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1.  CAD  INFRASTRUCTURE  AND  TOOLS 


1.1.  Massively  Parallel  Algorithms  for  Three-Dimensional  Device  Simulation  (A. 
Sangiovanni-Vincentelli) 

Recent  advances  in  processing  technology  create  the  need  of  modeling  jAysical 
irfienomena  described  in  terms  of  three-dimensional  geometries.  For  instance,  submicron  technol¬ 
ogy  for  MOS  devices  requires  the  accurate  modeling  of  narrow-channel  effects  which,  in  turn, 
imply  the  use  of  3-D  discretization.  Three-dimensional  effects  must  be  modeled  also  in  advanced 
bipolar  structures.  However,  including  these  new  effects  yields  an  enormous  increase  in  CPU 
time  if  the  present  2-D  algorithms  are  extended  to  cover  the  three-dimensional  case  on  a  conven¬ 
tional  processor.  Large  vector  supercomputers  have  been  used  to  simulate  this  class  of  problems: 
[1]  a  three-dimensional  device  simulator  was  developed  based  on  a  seven-point  finite-difference 
discretization  and  the  biconjugate  gradient  method  was  used  to  solve  the  non-symmetric  linear 
system  arising  from  the  previous  discretization.  The  veaorization  provided  a  speed-up  of  sixteen 
for  typical  problems.  Since  supercomputers  offer  such  a  limited  degree  of  parallelism,  other 
approaches  are  under  investigation  to  efficiently  solve  very  large  simulations.  Recently,  mas¬ 
sively  parallel  algorithms  have  been  used  for  linear  capacitance  evaluation  in  three-dimensional 
structures  showing  good  computational  performance  [2],  In  this  woilc  we  present  a  full  3D  device 
simulator  developed  on  a  Connection  Machine  {3J.  The  CM  is  a  massively  parallel  SIMD  com¬ 
puter  with  up  to  65,536  bit  serial  processors.  Each  processor  has  64  Kbits  of  memory.  Communi¬ 
cation  is  either  a  fixed  distance  on  an  N-dimensional  grid,  or  direct  to  an  arbitrary  processor  based 
on  a  hypercube. 

In  this  paper  we  will  extend  the  techniques  presented  in  [2]  to  device  simulation.  - 

The  simulator  consists  of  a  Poisson  solver  and  of  a  general  drift-diffusion  equation  solver. 
Poisson’s  equation  is  discretized  on  a  three-dimensional  finite-difference  grid,  with  non  uniform 
spacing  which  has  been  mapped  on  the  CM  architecture  by  allocating  a  processor  to  each  grid 
node.  The  non-linear  equations  have  been  solved  by  Newton  iteration.  The  linear  system  which 
has  to  be  inverted  at  each  non-linear  loop  is  symmetric,  positive  definite  and  diagonally  dom¬ 
inant  These  properties  guarantee  that  Incomplete  Cholesky  Conjugate  Gradient  method  con¬ 
verges  to  the  solution.  A  red/black  partitioning  [2]  of  the  unknowns  has  been  performed  to 
achieve  a  high  parallelism.  The  parallelism  of  this  method  has  been  enhanced  by  exploiting  the 


processors  which  are  idle  during  the  preconditioning  phase.  Mote  specifically,  defining  the  Jaco¬ 
bian  matrix  B  =  [lI/]  ^  +  Ert.  where  [l'I/]  ^  factorization  of  B  for  a  proper  r/b  ordering,  we  com¬ 
pute  also  the  preconditioning  for  the  associated  b/r  ordering.  It  is  possible  to  show  that  diis  tech¬ 
nique  reduces  the  maximum  value  of  the  entries  of  the  error  matrix  E .  This  preconditioning  does 
not  requires  any  additional  overhead,  becajise  the  two  solutions  can  be  evaluated  using  die  idle 
processors  during  the  preconditioning  phase.  Table  1  shows  die  improvements  in  terms  of  itera¬ 
tions  for  the  new  method  when  compared  with  the  standard  one.  The  test  structure  is  a  3D  MOS 
capacitor  on  a  uniform  substrate  (ffa  ^  10'^)  and  the  number  of  grid  nodes  was  3120. 

When  the  current  flow  is  not  negligible,  the  solution  of  the  full  set  of  drift-diffusion  equa¬ 
tions  is  required.  The  Scharfetter-Gummel  approach  [4]  is  used  to  discretize  the  continuity  equa¬ 
tions,  the  coupled  Newton  method  to  solve  the  non-linear  system  of  algebraic  equations,  and  the 
Biconjugate  Gradient  method  [1]  to  invert  the  Jacobian  matrix.  Since  now  there  are  duee  vari¬ 
ables  for  each  grid  ncue,  as  opposed  to  one  for  Poisson’s  equation,  we  use  a  two-level  matrix, 
where  the  higher-level  accounts  for  the  interactions  among  adjacent  nodes,  while  die  (3x3) 
lower-level  matrices  are  used  to  represent  the  interactions  among  the  variables  associated  with 
die  same  node.  Following  this  approach,  the  BiconJugate  Gradient  method  has  been  implemented 
replacing  each  arithmetic  operation  by  its  matrix  counteipart  The  preconditioning  scheme  has 
been  formulated  using  a  red/black  partitioning  of  the  high-level  matrix.  While  alternative 
preconditioning  approaches  are  now  under  evaluation,  it  is  useful  to  consider  the  performance  of 
the  overall  method  for  a  simple  diode  polarized  in  high-inJection  regime.  The  diode  has 
=  10"“  (cm-*),  the  whole  domain  is  a  cube  of  3p  of  side,  and  the  n-doping  profile  is 
modeled  as  a  Ip  cube.  The  number  of  iterations  required  by  the  Newton  method  to  reach  conver¬ 
gence  and  the  global  number  of  iterations  requited  by  the  BCG  technique  are  shown  in  Table  2 
as  a  function  of  the  applied  voltage  and  of  the  number  of  grid  nodes.  The  initialization  of  the  vari¬ 
ables  has  been  performed  by  a  linear  extrapolation  from  the  two  previous  solutions.  An  impor¬ 
tant  parameter  to  estimate  the  accuracy  of  the  solution  is  given  by  the  conservation  of  the  current 
even  at  very  low  levels.  We  have  simulated  the  same  diode  in  reverse  region  (IV).  evaluating  the 
currents  generated  in  the  depleted  region.  The  current  conservation  was  achieved  up  to  5  digits 
for  a  current  density  of  the  order  of  10"'*  Ampere/p*  using  a  relative  convergence  tolerance  of 
IC^*  for  the  Newton  method  and  Id"*  for  the  BCG.  Our  preliminary  measurements  lead  us  to 
expect  a  performance  of  700  Mflops  on  a  fully  configured  Connection  Machine. 


Bias  Step 

Non-lin.  It 

fStw 

0.0 
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202 

155 

0.1 

3 

63 

47 

0.2 

3 

55 

45 

0.3 

3 

60 

47 

Table  1:  Comparison  of  the  standard  iteration  with  the  new  one. 
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U.  Almost  Periodic  Fourier  Transform 

Many  circuits  of  interest  have  nonlinear  elements  and  have  inputs  of  incommensurable  he- 
quencies.  Simulating  such  systems  with  harmonic  balance  requires  computation  of  an  Almost 
Periodic  Fourier  Transform  (AHT). 

The  APFT  can  be  described  as  follows:  Given  is  a  set  of  n  fundamental  frequencies  and 
2a  -f  1  times.  Given  the  amplitudes  aiKl  phases  of  the  fundamentals  it  is  easy  to  compute  the 
function  value  at  the  times;  this  transfonn  is  called  Conversely,  given  the  fimcdon  values  at 
dK  2a  1  times,  we  can  construct  the  amplitudes  and  phases;  this  transform  is  F.  Both  F  and  F~^ 
are  linear  maps.  The  difficulty  is  that  if  the  times  are  not  carefully  chosen,  the  matrices  represent¬ 
ing  F  and  are  extremely  ill-conditioned,  resulting  in  large  numerical  errors. 

The  standard  Fourier  transform  is  a  special  case  in  which  the  frequencies  are  evenly  ^aced. 
In  this  case  using  evenly  spaced  timepoints  gives  a  matrix  whose  condition  number  is  0(n). 
Choosing  evenly  spaced  time  points  in  the  APFT  case  typically  gives  condition  numbers  on  the 
order  of  n!. 

The  APFT  problem  then  is  the  selection  of  a  good  set  of  timepoints.  In  previous  work  we 
found  a  good  heuristic  solution.  We  chose  a  larger  number  of  timepoints,  say  2(2n  + 1).  Each 
timepoint  determines  a  row  of  F"‘.  We  iteratively  add  to  the  set  of  'chosen’  timepoints  the 
unchosen  one  whose  row  is  most  nearly  orthogonal  to  the  rows  corresponding  to  the  chosen 
timepoints;  the  algorithm  is  similar  to  Gram-Schmidt  orthogonalization.  In  practice  this  yields 
and  F"*  whose  rows  are  ‘nearly  orthogonal’  in  a  well-defined  sense,  from  which  we  prove  that  F 
and  F”‘  are  well-conditioned. 

We  arc  currently  seeking  an  algorithm  which  is  deterministic  (uses  no  randomness), 
efficient,  and  constructs  a  set  of  2n+l  timepoints  yielding  an  F  whose  condition  number  is  less 
than  a  specified  bound. 

One  piece  of  this  puzzle  is  to  compute  an  ’almost  period’  of  the  system;  a  value  T  whose 
residue  modulo  each  of  the  given  periods  is  smaller  than  a  given  limit  e.  We  have  shown  how  to 
do  this  by  applying  the  Lenstra,  Lenstra,  and  Lovasz  (LLL)  basis  reduction  algorithm.  The  algo¬ 
rithm  efficiently  constructs  a  short  (nearly  shortest)  vector  in  a  laaice  described  by  an  arbitrary 
set  of  basis  vectors;  it  may  be  thought  of  as  a  generalization  of  the  continued  fraction  algorithm. 

Ideally,  we  would  like  to  choose  a  set  of  timepoints  which  give  a  matrix  which  is  neariy 
identical  to  the  matrix  representing  the  standard  Fourier  transform.  We  suppose  that  the  frequen¬ 
cies  are  described  in  a  way  which  includes  the  linear  relations  between  them.  For  instance  a  set 
of  frequencies  could  be  specified  as  { / 1./2./ -fi ....).  If  there  is  a  map  of  the  ‘indepetj- 

dent’  frequencies  (fi,f  2 .  and/3  in  the  example  above)  into  the  set  {0,1 . n  -1}  such  that  the. 

full  set  of  n  frequencies  maps  1-1  onto  that  set,  then  Kronecker’s  theorem  proves  that  there  is  a 
time  Ti  at  which  the  phase  angles  of  the  fundamental  frequencies  arc  (to  within  an  arbitrary 
degree  of  accuracy)  the  values  0, 2^ ,  4^ . ^^se  the  row  of  F'  determined  by 

Tj  is  (approximately)  a  row  of  the  sundard  Fourier  transform  matrix.  Furthermore,  if  the  full  set 
of  limes  is  equally  spaced,  i.e.,  .  ,(2n+l)ri,  then  the  APFT  matrix  F"’  is  (approxi¬ 

mately)  the  Fourier  matrix.  In  this  case  it  is  not  only  well-conditioned,  but  furthermore  its 
inverse  F,  which  is  the  matrix  that  we  actually  need,  is  (approximately)  known  in  advance.  There 
are  well-established  numerical  techniques  to  then  compute  the  exact  value  of  F  from  the 


constructed  f 

This  leaves  us  with  the  two  fundamental  problems  on  which  we  are  now  focusing. 

First:  does  a  1-1  map  from  the  set  of  h^uencies  to  {0 ,  n-1)  exist?  If  so,  how  do  we 
find  it?  And  if  not,  what  do  we  do  next? 

Second,  how  do  we  find  the  value  T\  whose  existence  is  guaranteed  by  Kronecker’s 
theorem? 

The  first  question  appears  to  be  essentially  combinatorial  in  nature,  and  we  do  not  currently 
have  a  good  approach  to  solving  it 

The  second  question  may  be  related  to  the  problem  of  finding  an  almost  period,  which  we 
have  solved  as  described  above.  For  example,  in  some  cases  using  rj  =  r/(2a+l)  works.  In 
other  cases  it  is  possible  to  reapply  the  LLL  algorithm  to  compute  an  acceptable  T\.  Thus  there  is 
some  promise  to  this  technique,  though  we  do  not  know  how  to  solve  the  general  case. 


U.  Fini(e>Time  Theory  of  Simulated  Annealing  on  Special  Energy  Landscapes  (A. 
Sangiovanni-Vincentelli) 

The  best  theoretical  results  on  simulated  annealing  apply  to  arbitrary  spaces,  including 
those  for  NP-hard  problems,  and  are  necessarily  somewhat  weak.  In  practice,  the  ‘cooling 
schedules*  used  in  annealing  are  more  rapid  than  those  analyzed  by  the  theory,  and  the  results 
obtained  are  much  better  than  theory  would  predict.  The  strength  of  the  practical  results,  com¬ 
bined  with  the  fact  that  it  is  easy  to  construct  problems  on  which  annealing  will  not  work  well, 
means  that  problems  encountered  in  practice  have  special  properties  of  which  annealing  takes 
advantage.  Thus  we  take  a  two-pronged  approach:  studying  properties  of  energy  landscapes  of 
real-world  problems;  and  analyzing  the  behavior  of  annealing  on  landscapes  with  given  proper¬ 
ties. 

We  have  observed  many  fractal  properties  of  combinatorial  problems,  focusing  on  place¬ 
ment  problems.  In  particular,  the  sequence  of  energies  observed  over  time  is  a  Brownian  or  frac¬ 
tional  Brownian  motion  (fBm).  We  have  proved  that  a  random  walk  on  a  fractal  landscape  pro¬ 
duces  such  fBm  ‘energy  trajectories’,  and  we  conjecture  that  this  mechanism  is  responsible. 
Also,  direct  sampling  of  points  in  the  landscape  shows  an  approximate  power-law  relation 
between  distance  and  energy  difference.  Such  a  relation  is  a  natural  extension  of  the  definition  of 
fractalness  in  Euclidean  space  (the  landscapes  of  interest  do  not  lie  in  Euclidean  space). 

Most  recently  we  have  studied  a  class  of  mathematically  defined  deterministic  self-affine 
^ctal  energy  landscapes  on  the  reals.  In  this  limited  domain  rigorous  results  can  be  derived 
using  the  tools  of  rapidly  mixing  Markov  chains. 

In  particular  it  is  proved  that  using  a  geometric  cooling  schedule,  the  expected  energy 
difference  between  the  state  found  by  annealing  and  a  true  global  minimum  decreases  approxi¬ 
mately  as  a  (negative)  power  of  the  total  time  spent  annealing.  The  power  itself  depends  on 
parameters  of  the  problem  instance. 

In  this  1 -dimensional  case,  random  sampling  obeys  a  power  law  like  that  holding  for 
annealing,  and  random  sampling  may  even  be  faster  (depending  on  the  problem  instance).  But 
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for  higher-dimensional  fractals  the  speed  of  annealing  is  basically  unchanged  while  that  of  ran¬ 
dom  sampling  decreases  dramatically.  The  combinatorial  spaces  of  practical  interest  do  not  have 
a  well  defined  ‘dimension’  but  do  have  many  characteristics  of  very  high-dimensional  spaces, 
making  these  results  particularly  relevant. 

These  are  dramatic  new  results.  Previous  fonnal  studies  of  annealing  have  required  that  the 
time  spent  be  exponential  in  the  problem  size,  which  is  unrealistic.  In  this  case  die  time  is 
power-law  in  the  ‘quality’  of  result  desired,  and  polynomial  in  D  (for  a  D-dimensional  fractal). 
Depending  on  whether  range-limiting  is  employed,  the  time  may  be  polynomial  or  exponential  in 
the  log  size  of  one  ‘side’  of  the  D-cube. 

In  short,  the  theoretical  analysis  performed  yields  strong  results  for  the  restricted  (but  we 
believe  relevant)  class  to  which  it  applies.  It  does  so  using  a  methodology  which  reflects  uliat  we 
think  to  be  the  way  annealing  functions  in  practice. 

We  hope  to  extend  these  results  to  random  fractals,  which  will  require  developing  a  com¬ 
pletely  new  set  of  mathematical  tools.  We  also  plan  to  use  the  theoretical  analysis  to  guide 
further  examination  of  practical  problems,  and  ultimately  to  constnia  efficient  annealing 
schedules  and  estimate  their  performance. 


1.4.  Implementation  of  the  DFT-based  Quasi-Periodic  Steady  State  Analysis  in  Spectre  (A. 
Sangiovanni-Vincentelli) 

Spectre  (5]  is  a  harmonic  balance  simulator  for  analog  and  microwave  circuits.  It  analyzes 
iK)nlinear  analog  circuits  using  the  harmonic  balance  technique.  Linear  devices  can  be  evaluated 
directly  in  the  frequency  domain.  Nonlinear  devices  are  in  general  impossible  to  evaluate  in  the 
frequency  domain  directly;  therefore,  the  excitation  spectra  of  a  nonlinear  device  is  first 
transformed  into  time  domairt.  The  response  is  then  evaluated  in  the  time  domain  and 
transformed  back  to  the  frequency  domain.  Currently,  Spectre  uses  Discrete  Fourier  Transforms, 
and  hence  Fast  Fourier  Transforms,  to  convert  periodic  signals  between  time  and  frequency 
domain.  For  quasiperiodic  signals,  a  special  transform  ( the  Almost  Periodic  FT  [6,7])  is  used  to 
perform  the  conversions. 

When  the  nonlinearities  in  the  devices  ate  algebraic,  the  coefficients  of  the  sinusoids  are  fre¬ 
quency  independent  Thus,  for  the  purposes  of  evaluating  the  nonlinear  devices,  the  actual  funda¬ 
mental  frequencies  are  of  no  importance  and  can  be  chosen  freely.  In  particular,  the  fundamental 
frequencies  can  be  chosen  to  be  multiples  of  some  arbitrary  frequency  so  that  the  resulting  sig¬ 
nals  will  be  periodic,  and  DFT  can  be  used.  These  artificially  chosen  fundamental  frequencies  are 
not  actually  used  in  the  hannonic  balance  calculations.  They  are  used  only  to  determine  in  which 
order  to  place  the  terms  in  the  spectra.  In  other  words,  we  are  only  interested  in  the  correspon¬ 
dence  (mapping)  between  quasiperiodic  and  periodic  hannonic  indices. 

In  order  to  make  computation  involving  quasiperiodic  signals  tractable,  we  need  to  truncate 
the  frequencies  into  a  finite  set  The  box  and  diamond  truncation  methods  are  two  popular  trun¬ 
cation  methods.  With  the  box  truncation,  only  the  first  H  harmonics  of  each  fundamental  fine- 
quency  are  considered.  The  diamond  truncation  limits  the  absolute  sum  of  the  indices  of  each 
fundamental  frequency  to  be  less  than  or  equal  to  H.  Currently  Spectre  only  considers  two 


fundamental  frequencies. 

For  a  set  of  frequencies  coming  from  a  box  truncation,  we  have 
w  I  w  sill  (Xl)-«-Jk2(X2);0^il:l  1)1:21  *0if)l;2  <0  the  correspondence  between 

quasiperiodic  and  periodic  harmonic  indices  for  this  particular  truncation  method  is 
*»(2«2+l)/*l+)k2  (1) 

For  a  set  of  frequencies  obtained  with  a  "diamond"  truncation, 
w  \w=kl  (kl)  +  k20ay,  1*11  +  1*21  S«.*l+*2>a0,*l!=*2if*2>0 

*=(//+!)*! +//*2  (2) 

is  the  mapping  equation. 

The  users  can  specify  a  specific  truncation  scheme  by  specifying  three  parameters:  HI,  H2, 
and  H,  where 

1*11  <//l  1*21  <W2  1*11  +  1*21  <// 

These  constraints  result  in  a  combination  of  box  and  diamond  truncation.  In  order  to  use 
the  above  equations  to  perform  mapping  from  kl,  k2  to  k,  we  first  expand  the  set  of  frequencies 
to  its  nearest  box  or  diamond  truncation  whichever  is  smaller.  Next,  we  used  either  equation  (1) 
or  (2)  to  map  the  quasiperiodic  harmonic  indices,  kl  and  k2  into  their  corresponding  periodic 
indices  k,  and  construct  the  spectra.  In  order  to  use  FFT,  the  size  of  the  spectra  has  to  be  a  power 
of  2.  Therefore,  sometimes  we  would  need  to  extend  the  size  of  spectra  further  to  meet  the 
requirement. 

The  DFT  based  spectral  analysis  has  been  implemented  in  Spectre,  using  the  mapping  tech¬ 
nique  described  above.  A  few  simple  bench  mark  circuits  were  used  to  test  the  implementation. 
A  sample  circuit  which  contains  a  simple  polynomial  conductor  shows  the  following  result: 

polynomial  conductor./  =  +  2v*  +  v  6 


No.  of 
harmonics 

(diamond  truncation) 

7 

6 

5 

4 

3 

2 

FFT 

Time(scc) 

0.43 

0.443 

0.15 

0.17 

30.05 

0.03 

APFT 

Time(sec) 

29.52 

12.73 

4.95 

1.58 

0.58 

0.42 

The  DFT  (FFT)  based  spectral  analysis  is  faster  compared  to  the  APFT  based  spectral 
analysis.  The  amount  of  speed  up  increases  as  the  number  of  harmonics  increases. 
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IJS.  Generation  of  Analytical  Models  for  Layout  Interconnects  (A.  Sangiovanni-Vincentelli) 

As  the  scale  of  integration  and  speed  requirements  of  integrated  circuits  increase,  accurate 
modelling  of  both  tm-chip  and  off-chip  interconnects  becomes  necessary.  As  the  widths  of  inter¬ 
connection  lines  are  scaled  down,  the  thickness  is  either  kept  constant  or  scaled  by  a  much 
smaller  faaor  to  limit  line  resistance.  As  a  result  of  this,  the  fringing  and  sidewall  effects  dom¬ 
inate  the  interconnea  capacitances,  causing  gross  error  in  estimation,  if  parallel-plate  approxima¬ 
tion  is  used  to  compute  line-to-ground  and  crossover  capacitances.  Also,  shrinking  widths  of  lines 
and  increasing  ratio  of  thickness  to  separation  between  lines  increase  tremendously  the  coupling 
capacitance  between  two  adjacent  parallel  lines  as  a  fraction  of  the  total  capacitance  of  each  line. 
This  makes  estimation  of  coupling  capacitances  between  adjacent  lines  indispensable  for  today’s 
VLSIs. 

Since  the  time  constants  associated  with  interconnects  scale  by  a  much  smaller  factor  than 
those  of  devices,  they  are  becoming  increasingly  significant  in  determining  the  speed  of  digital 
systems.  At  high  speeds  of  operation,  some  of  the  interconnects  act  like  transmission  lines,  and 
matching  considerations  become  important  in  the  design  process.  Moreover,  when  very  aggres¬ 
sive  design  rul'^s  are  employed,  stray  coupling  can  cause  the  logic  state  of  a  line  to  change  due  to 
switching  of  adjacent  lines.  In  analog  and  mixed  analog-digital  circuits,  needless  to  mention,  per¬ 
formance  can  be  extremely  sensitive  to  intercoruiect  parameters.  All  these  factors  combined  have 
made  interconnect  modelling  an  active  area  of  research. 

Analytical  expressions  arc  available  for  modelling  some  simple  configurations.  However, 
one  may  encounter  very  complicated  configurations  in  the  layout  for  which  closed  form  expres¬ 
sions  are  not  known  for  modelling  purposes.  Numerical  methods  based  on  finite-difference, 
finite-element  or  integral -equation  techniques  can  be  employed  to  extract  the  desired  electrical 
parameters  Ountped  or  distributed).  These  methods  solve  Maxwell’s  equations  in  some  form,  and 
usually  incur  very  high  computational  cost.  Hence,  there  is  a  need  for  analytical  expressions 
which  can  be  used  by  CAD  tools  for  fast  extraction  of  large  layouts.  Such  expressions  will  also 
provide  useful  insight  to  circuit  and  layout  designers. 

An  experimental  program  has  been  developed  which  can  generate  analytical  models  for 
capacitances  Oine-to-ground  and  line-to-line)  in  some  specific  configurations  by  performing  a 
series  of  accurate  numerical  simulations.  The  configurations  considered  right  now  arc  (a)  single 
line,  (b)  adjacent  parallel  lines,  and  (c)  crossing  lines.  The  models  are  being  used  in  the  routing 
of  analog  circuits. 


1.6.  Analog  Testing 

In  this  projea  we  aim  to  reduce  production  testing  time  by  designing  efficient  test  sets.  As 
is  done  in  digital  testing,  the  test  sets  we  design  will  be  intended  to  detect  the  faults  that  arc  likely 
to  hafpen  in  practice.  Therefore,  we  base  our  test  set  design  on  a  fault  model  developed  for 
estimating  yield  in  fabrication,  where  faults  are  characterized  as  either  catastrophic  or  parametric. 
It  has  been  shown  in  previous  work  that  catastrophic  faults  are  relatively  easy  to  detea  in  analog 
circuits.  Typically  a  few  ac  and  dc  tests  are  required.  Consequently,  most  of  production  testing 


time  is  spent  attempting  to  detect  parametric  faults  by  verifying  all  of  a  circuit’s  specifications. 
As  a  result,  any  major  reduction  in  testing  time  should  come  finom  reducing  the  number  of 
Reification  tests  that  need  to  be  performed.  In  addition,  since  testing  is  terminated  upon  tiie  first 
failure  of  a  test,  an  optimal  ordering  of  tests  can  further  reduce  testing  time. 

To  identify  unnecessary  specification  tests  and  optimally  order  the  remaining  tests,  we 
begin  with  a  statistical  description  of  parameter  fluctuations  in  fabrication.  We  add  specification 
tests  to  the  set  of  optimally  ordered  tests  one  by  one  by  looking  at  fault  coverages  of  various  sub¬ 
sets  of  tests.  As  tests  are  added,  the  current  fault  coverage  of  the  set  is  determined,  and  tire  test 
set  is  complete  when  sufficient  fault  coverage  has  been  reached. 

In  the  last  six  months  an  algorithm  has  been  implemented  which  optimally  orders  necessary 
specification  tests  and  eliminates  unnecessary  srecificatlon  tests.  It  has  been  tested  on  some, 
examples  and  preliminary  results  show  that  we  are  able  to  attain  approximately  a  10  times 
speed-up  over  crude  Monte  Carlo  methods  for  the  same  accuracy.  The  example  below  is  an  op 
amp  which  has  IS  specifications  and  a  process  model  containing  13  parameters.  As  can  be  seen 
in  this  example,  very  few  specifications  are  critical.  Future  work  rreeds  to  be  done  on  improving 
the  reliability  of  the  algorithm,  and  more  examples  need  to  be  tested. 


SPECmCATlON 
(in  order  of  importance) 

FAULT  COVERAGE 

OF  CURRENT  TEST  SET 

TEST  YIELD 

phase  margin  >  60  degrees 

0.990224 

0.739471 

settling  time  (IV  step,  0.1%  interval)  <  500ns 

0.999894 

0.9929 

minimum  output  swing  <  -1.2  V 

0.999939 

0.99996 

gain>  I(X)00 

0.999998 

0.999907 

maximum  output  swing  >  1.2  V 

1.0 

0.999968 

unity  gain  bandwidth  >  4  MHz 

1.0 

1.0 

maximum  common  mode  input  voltage  >  1.0  V 

1.0 

1.0 

CMRR  >  80  dB 

1.0 

1.0 

power  dissipation  <  1.1  mW 

1.0 

1.0 

PSRR  @  do  70  dB 

1.0 

1.0 

PSRR@  lkHz>40dB 

1.0 

1.0 

minimum  common  mode  mput  voltage  <  -1.0  V 

1.0 

1.0 

slew  ra.e  >  2.5  V^s 

1.0 

1.0 

maximum  systematic  offset  <  0.001  V 

1.0 

1.0 

minimum  systematic  offset  >  -0.(X)1  V 

1.0 

1.0 

*  Cadence  Design  Systems 


1.7.  Constraint  Generation  for  Rcuting  Analog  Circuits 

The  analog  nature  of  real-world  signals  makes  it  impossible  to  eliminate  completely  analog 
circuits  even  when  the  host  system  is  implemented  in  digital  domain.  As  the  trend  for  integrating 
both  digital  and  analog  functions  continue,  CAD  tools  become  an  essential  part  of  design  metho¬ 
dology.  At  this  time,  while  CAD  tools  for  digital  design  are  well  developed,  tools  for  analog 
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design  are  still  inadequate  for  improving  design  productivity.  Also,  at  very  high  frequencies  of 
operation,  since  digital  technology  has  not  matured,  analog  processing  of  signals  is  widely  used. 
Although  simulation  and  optimization  tools  are  used  by  analog  designers  in  die  electrical  design 
phase,  one  is  yet  to  see  use  of  general  purpose  CAD  tools  for  layout  design.  Since  performance  of . 
analog  drcuits  can  be  very  sensitive  to  stray  effects,  manual  layout  is  time  cmisuming  and  erroi> 
prone. 

Layout  of  analog  circuits  is  usually  designed  by  designers  in  an  iterative  fashion.  In  each 
pass,  after  the  layout  is  designed,  all  the  layout  parasitics  are  extraaed  and  the  circuit  is  simulated 
to  check  if  performance  specifications  are  met.  If  they  are  not  met,  possible  trouble  spots  in  the 
layout  are  "guessed,"  some  changes  made  to  the  layout,  and  the  process  repeated.  This  is  a  highly 
inefficient  approach,  which  can  give  rise  to  a  large  number  of  time  consuming  iterations. 

We  follow  a  novel  constraint-based  approach[8]  for  automating  analog  layout,  bodi  place¬ 
ment  and  routing,  although  presently  we  are  concentrating  on  routing.  Analog  routing  is  con¬ 
sidered  as  much  more  than  a  path-conneaion  problem.  In  the  first  phase,  critical  layout  parasitics 
are  detected  using  sensitivity  analysis[91.  Then  a  set  of  constraints  is  generated  on  these  critical 
parasitics  to  ensure  that  circuit  performance  degradation  remains  within  some  specified  limit 
These  constraints  can  then  be  used  to  drive  the  autorouter.  This  results  in  a  high  probability  diat 
performance  specifications  be  met  in  the  very  first  pass  of  routing.  The  parasitic  constraints  are  of 
two  types;  bounding  constraints  and  matching  constraints  .  Since  there  can  be  many  possible 
combinations  of  constraints  which  meet  performance  specifications,  a  unique  algorithm  is  used  to 
generate  a  set  of  constraints  which  maximize  the  flexibility  of  the  router. 

The  program  PARCAR  (a  parasitic  constraint  generator  for  analog  routing)  developed  by  us 
generates  constraints  on  capacitances  (both  net-to-net  and  net-to-ground)  for  a  given  set  of  perfor¬ 
mance  constraints.  Interface  programs  have  been  developed  for  the  simulators  SPICES  and 
SWAP  to  automatically  obtain  sensitivities  with  respect  to  all  possible  routing  capacitances 
which  may  possibly  exist  in  the  layout.  These  sensitivities  are  fed  to  PARCAR  for  generating 
constraints.  PARCAR  is  being  interfaced  to  the  area  router  ROAD  and  the  channel  router 
ROADRUNNER. 


1.8.  Logic  Synthesis  for  Programmable  Gate  Arrays  (Alberto  Sangiovanni-VincenteDi) 

Programmable  gate  arrays  (PGA’s)  are  becoming  increasingly  important  architectures  for 
rapid  system  prototyping.  One  common  feature  of  such  architectures  is  the  presence  of  a  repeated 
array  of  identical  logic  blocks.  A  logic  block  is  a  versatile  configuration  of  logic  elements  which 
can  be  programmed  by  the  user. 

With  the  growing  complexity  of  the  logic  circuits  that  can  be  packed  on  a  PGA,  it  becomes 
necessary  to  have  automatic  tools  that  map  logic  functions  onto  these  architectures.  A  ample 
application  of  the  logic  synthesis  techniques  used  for  semi-custom,  ceU-based  architectures  gen¬ 
erally  does  not  yield  a  satisfactory  result  and  may  be  totally  inadequate  for  such  architectures.  We 
are  investigating  the  problem  of  combinational  logic  synthesis  for  two  interesting  and  popular 
programmable  gate  array  architectures:  a  RAM  based  architecture  (Xilinx)  and  a  multiplexor 
based  one  (Actel).  We  address  the  problem  of  synthesizing  a  set  of  Boolean  equations  using  these 


architectures  such  that  minimum  number  of  logic  blocks  are  used. 

In  the  Xiliax  architecture,  a  basic  block  can  implement  any  logic  function  of  up  to  five  vari¬ 
ables.  The  algorithm  has  two  main  steps:  decomposition  and  covering.  Decomposition  is  applied 
on  those  functions  which  have  more  than  five  inputs.  Various  decomposition  techniques  are  used 
to  obtain  a  network  in  which  all  nodes  have  at  most  five  fanins,  e.g.  classical  decomposition  tech¬ 
niques  like  Karp-Roth  decomposition  are  investigated  and  compared  with  kernel  extraction  based 
on  support  of  the  functions.  After  decomposition,  the  new  network  may  have  some  functions 
which  can  be  realized  by  one  basic  block.  So,  covering  methods  are  used  to  reduce  the  number  of 
basic  blocks  needed  for  the  network.  We  use  exact  covering  formulation  and  also  some  heuristics 
to  solve  this  otherwise  intractable  problem. 

In  the  Aael  architecture,  the  basic  block  is  a  configuration  of  2-to-l  multiplexors.  Our 
method  of  synthesis  is  similar  to  the  one  employed  in  MISII.  However,  guided  by  the  architec¬ 
ture.  we  choose  a  different  representation  of  subject-graph  and  pattern-graphs.  We  ctqrture  the 
gates  in  the  library  by  a  very  small  pattern-set  After  covering  the  subject-graphs  by  pattern- 
graphs,  we  use  an  iterative-improvement  phase  which  uses  collapse  and  decompose  operations  to 
improve  the  result. 

The  results  for  both  the  architectures,  as  given  in  the  following  tables,  are  encouraging.  Our 
results  are  given  under  column  mis-pga.  For  Actel  architecture,  we  compare  our  results  with 
MISII  results,  whereas  for  Xilinx,  we  compare  with  results  obtained  from  industry.  Area  is  in 
terms  of  number  of  basic  blocks. 


XILINX 


example 

no. nodes 

mis-pga 

ind.  results 

(init.) 

Area 

time 

Area 

lObitreg 

20 

10 

6.9 

11 

lOcount 

126 

23 

154.2 

19 

ISOdegc 

146 

21 

74.5 

34 

3to8dmux 

101 

30 

81.1 

25 

4-16dec 

70 

12 

57.5 

18 

4cnt 

66 

17 

35.0 

10 

Sbappreg 

141 

27 

64.9 

27 

Scount 

138 

20 

57.7 

29 

9bcasc2 

135 

34 

1028.9 

45 

99bcasc2 

133 

29 

91.9 

42 

arbiter 

146 

21 

73.3 

13 

11 


ACTEL 


example 

MISII 

mis-pga 

Area 

time 

Area 

time 

fSlm 

52 

21.5 

50 

59.7 

bw 

80 

46.7 

65 

82.0 

rot 

310 

108 

3292 

497.4 

Sxpl 

52 

19.5 

46 

43.2 

c499 

174 

81.7 

166 

61.0 

Ccl908 

189 

104.99 

185 

3258888.6 

C5315 

732 

353 

666 

1865.3 

It  should  be  mentioned  here  that  the  algorithms  described  here  are  general:  e.g.  the  algo¬ 
rithms  for  Xilinx  architecture  would  work  for  any  architecture  in  which  the  basic  block  imple¬ 
ments  a  ituiction  of  n  variables,  n  >=  2. 

As  future  work,  we  plan  to  look  into  minimizing  the  delay  through  the  critical  path.  Exten¬ 
sion  to  sequential  circuits  is  also  planned.  It  is  hoped  that  such  a  tool  will  enable  evaluations  of 
different  PGA  architectures  from  synthesis  point  of  view  and  result  in  design  of  better  architec¬ 
tures. 


1.9.  Development  of  the  SIS  (Sequential  Interactive  Synthesis  System)  (R.K.  Brayton) 

Several  critical  issues  regarding  the  internal  data  structures  and  the  interchange  format 
needed  to  be  resolved.  In  particular,  an  extended  version  of  BLIP  (Bericeley  Logic  Interchange 
Format)  was  designed  and  is  supported  by  SIS.  (Done  by  S.  Malik,  Tzvi  Ben-Tzur  and  K.  J. 
Singh).  This  extended  format  permits  specification  of  sequential  elements  Catches),  timing  infor¬ 
mation  (clocking  schemes)  and  state  transition  diagrams  along  with  gates  to  represent  digital 
logic.  A  reader  and  writer  for  this  format  was  written  and  incorporated  in  SIS. 

Area  optimization  using  retiming  and  resynthesis  ideas  (developed  earlier  last  year)  was 
implemented  as  part  of  SIS.  Surprisingly,  the  area  improvements  obtained  using  these  techniques 
were  negligible.  This  was  shown  to  be  a  possible  result  of  limitations  of  existing  combinational 
logic  optimization  tools  as  well  as  being  an  inherent  characteristic  of  the  logical  nature  of  stxne. 
classes  of  circuits  These  results  were  presented  along  with  their  analysis  at  HICSS  90  [10]. 

Performance  optimization  using  retiming  and  lesynthesis  was  satisfactorily  tackled  (work 
done  by  S.  Malik  and  K.  J.  Singh).  We  considered  the  problem  of  redesigning  a  given  pipelined 
circuit  to  meet  a  required  cycle  time.  We  demonstrate  how  the  pipelined  circuit  can  be 
transformed  to  a  combinational  circuit  and  show  that  solving  a  performance  optimization  prob¬ 
lem  for  this  combinational  circuit  is  both  necessary  and  sufficient  for  solving  the  performance 
optimization  problem  for  the  pipelined  circuit.  This  is  significant  in  two  ways.  Firstly,  it  shows 
that  all  known  (as  well  as  to  be  developed)  ideas  in  combinational  performance  optimization  can 
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be  used  in  pipeline  performance  optimization.  Second,  it  shows  that  diese  are  enough,  i.e.,  we 
need  not  consider  special  techniques  for  pipelined  circuits. 


1.10.  Retiming  and  Initialization  of  Finite  State  Machines  (R.K.  Brayton) 

Retiming  is  a  general  and  powerful  technique  to  perform  delay  or  area  optimization  of 
sequential  circuits.  When  sequential  circuits  are  specified  with  an  initial  state,  it  is  necessary  to 
maintain  the  initial  state  across  retimings.  We  have  developed  and  implemented  in  SIS  a  simple 
and  elegant  method  to  perform  this  computation.  If  the  initial  state  of  the  circuit  is  contained  in  a 
cycle  of  state  transitions,  our  algorithm  does  not  require  any  modification  of  the  logic  of  the  cir¬ 
cuit,  at  the  difference  of  previous  approaches.  For  finite  state  machines,  it  is  a  simple  matter  dur¬ 
ing  state  assignment  to  make  sure  that  the  irutial  state  is  contained  in  a  cycle  of  transitions  if  it  is 
not  already  the  case.  This  can  be  achieved  at  litde  or  no  cost  in  final  circuit  area.  Once  this  condi¬ 
tion  is  satisfied,  initial  states  can  be  maintained  across  retimings  with  no  need  for  costly  state 
transition  graph  extractions,  backtracking  searches,  or  logic  duplication.[ll] 


1.11.  Technology  Mapping  for  Area  and  Delay  (R.K.  Brayton) 

Technology  mapping  is  the  process  of  mapping  a  logic  network  onto  a  set  of  predefined  and 
characterized  library  gates.  Most  of  the  early  work  in  this  area  has  focused  on  minimizing  circuit 
area,  and  only  a  small  set  of  optimization  techniques  have  been  available  for  minimizing  circuit 
delay.  We  have  developed  and  implemented  several  new  algorithms  to  perfonn  delay  minimiza¬ 
tion  at  a  moderate  cost  in  area,  and  we  have  obtained  encouraging  results  (average  speedups  of 
43%  for  a  14%  increase  in  circuit  area).  These  algorithms  have  been  incorporated  in  the  latest 
release  of  misll  (version  2.2). 

The  basis  of  our  approach  is  to  decompose  circuits  in  networks  of  trees,  and  use  efficient 
algorithms  to  implement  each  tree  separately.  A  similar  decomposition  was  used  in  previous 
work  (e.g.  [12]).  The  main  originality  of  our  work  is  to  incorporate  within  the  same  framework 
two  kinds  of  trees:  fanin  trees,  which  implement  the  logic  of  the  circuit,  and  fanout  trees,  which 
distribute  the  output  signals  of  fanin  trees  to  their  destinations.  In  the  present  implementatiort, 
fanin  tree  optimization  and  fanout  tree  optimization  are  done  independently.  We  are  currently 
experimenting  with  the  best  way  to  integrate  these  two  optimizations. 


1.12.  Optimum  and  Heuristic  Algorithms  for  Finite  State  Machine  Decomposition  and  Par¬ 
titioning  (A.R.  Newton) 

Techniques  have  been  proposed  in  the  past[15,16,17,18]  for  various  types  of  finite  state 
machine  (FSM)  decomposition  that  use  the  number  of  states  or  edges  in  the  decomposed  circuits 
as  the  cost  function  to  be  optimized.  These  measures  are  not  reflective  of  the  true  logic 
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complexity  of  the  decomposed  circuits.  These  methods  have  been  mainly  heuristic  in  nature  and 
ofler  limited  guarantees  as  to  the  quality  of  the  decomposition.  In  this  woik[14],  optimum  and 
heuristic  algorithms  for  the  general  decomposition  of  FSMs  have  been  developed  such  that  the 
sum  total  of  the  number  of  ptxxluct  terms  in  the  one-hot  coded  and  logic  minimized  submachines 
is  minimum  or  minimal.  This  cost  function  is  much  more  reflective  of  the  area  of  an  optimaUy 
state-assigned  and  minimized  submachine  than  the  number  of  states/edges  in  the  submachine. 
The  problem  of  optimum  two-way  FSM  decomposition  has  been  formulated  as  one  of  symbolic- 
output  partitioning  and  has  been  shown  to  be  an  easier  problem  than  optimum  state  assignment 
A  procedure  of  constrained  prime-implicant  generation  and  covering  has  been  described  that 
represents  an  optimum  FSM  decomposition  algorithm,  under  the  specified  cost  fimction.  Exact 
I»t>cedures  are  not  viable  for  large  problem  instarKes.  A  novel  iterative  optimization  strategy  of 
symbolic-implicant  expansion  and  reduction,  modified  from  two-level  Boolean  minimizers,  that 
represents  a  heuristic  algorithm  based  on  the  exact  procedure  has  also  been  developed.  Reduction 
and  expansion  are  performed  on  functions  with  symbolic,  rather  than  binary-valued  outputs.  The 
heuristic  procedure  can  be  used  for  problems  of  any  size.  Preliminary  experimental  results  have 
been  presented  that  illustrate  both  the  efficacy  of  the  proposed  algorithms  and  the  validity  of  the 
selected  cost  function. 


1.13.  A  Unified  Approach  to  the  Decomposition  and  Re-decomposition  of  FSMs  (A.R. 
Newton) 

A  unified  frameworic  and  associated  algorithms  for  the  optimal  decomposition  and  re¬ 
decomposition  of  sequential  machines  have  been  developed.  This  framework  allows  for  a  uni¬ 
form  treatment  of  parallel,  cascade  and  general  decomposition  topologies,  operating  at  die  State 
Transition  Graph  (STG)  level,  while  targeting  a  cost  function  that  is  close  to  the  eventual  logic 
implementation. 

Previous  work  in  decomposition  has  targeted  specific  decomposition  topologies[20,21,23]. 
In  this  paper[19].  the  decomposition  problem  has  been  formulated  as  one  of  implicant  covering 
with  associated  constraints.  An  optimum  covering  corresponds  to  an  optimum  decomposition.  It 
has  been  shown  in  this  work  that  two-way  or  multi-way.  parallel,  cascade  or  general  decomposi¬ 
tion  topologies  can  be  targeted,  simply  by  changing  the  constraints  in  the  covering  step.  The  rela¬ 
tionship  of  this  work  to  preserved  partitions  and  covers,  traditionally  used  in  parallel  and  cascade 
decomposition  has  been  indicated. 

In  many  cases,  an  initial  decomposition  is  specified  as  a  starting  point  Attempting  to 
flatten  a  set  of  interacting  circuits  into  a  single  lumped  STG  could  require  astronomical  amounts 
of  CPU  time  and  memory.  Memory  and  CPU  time  efficient  re-decomposition  algorithms  that 
operate  on  distributed-style  specifications  and  which  are  more  global  in  nature  than  those 
{H'esented  in  the  past  have  been  developed  [22].  Arbitrary,  interacting  sequential  circuits  can  be 
optimized  for  area  and  performance  by  iteratively  applying  re-decomposition  algorithms  across 
latch  boundaries. 
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1.14.  Testability  Driven  Decomposition  of  Large  Finite  State  Machines  (A.R.  Newton) 

The  synthesis  of  controllers  in  the  form  of  interacting  finite  state  machines  (FSMs)  can 
result  in  improved  performance  and  smaller  area.  In  this  woric.  we  address  sequential  testability 
aspects  in  the  decomposition  of  FSMs. 

In  this  work,  testability-driven  decomposition  techniques  have  been  presented  that  realize 
controllers  as  FSM  networks  consisting  of  mutually  interacting  submachines  which  are  fuUy 
testable  for  all  sin^e  stuck-at  faults  without  requiring  direct  access  to  the  memory  elements. 
Constrained  decomposition  techniques  for  the  synthesis  of  fully  and  easily  testable  decomposed 
FSM  networks  have  been  described.  Subsequently,  an  exhaustive  classification  of  redundant 
faults  that  can  occur  in  a  single  FSM  embedded  in  a  network  has  been  presented.  Associating 
each  class  of  these  redundant  faults  with  a  don’t  care  set,  an  optimal  decomposition  technique  has 
been  described  that  synthesizes  controllers  as  iiredundant  FSM  netwoiks  with  no  area  overhead 
by  exploiting  these  don’t  cares  optimally. 

A  FSM  network  can  be  flattened  into  a  single  State  Graph  arrd  made  fully  testable  using 
previously  proposed  synthesis  techniques[24,2S..  However,  when  dealing  with  a  large,  lumped 
FSM  the  don’t  care  set  under  which  the  synthesis  has  to  be  optimal  could  be  huge.  Also,  if  the 
given  initial  representation  is  distributed,  flattening  it  into  a  single  lumped  STG  can  require  astro¬ 
nomical  amounts  of  memory  and  CPU  time. 

It  has  been  shown  in  this  work  that  when  a  circuit  is  being  implemented  as  an  interconnec¬ 
tion  of  FSMs,  the  required  don’t  care  set  in  an  optimal  synthesis  procedure  can  be  heavily  pruned. 
Partitioning  of  logic  serves  to  filter  out  the  part  of  the  don’t  care  set  which  is  not  useful  The  new 
synthesis  procedure  operates  on  a  distributed-style  representation  of  interacting  STGs,  which  is 
considerably  more  compact  than  a  lumped  representation,  carrying  out  a  series  of  local  analyses. 
This  decomposition-based  optimal  synthesis  technique  is  significantly  more  efficient,  both  in 
terms  of  memory  and  CPU  time  usage  than  previously  proposed  optimal  synthesis  techniques.  It 
is.  therefore,  viable  for  circuits  of  greater  size. 


1.15.  Cache  Management  Techniques  for  Multiprocessors,  with  an  Emphasis  on  VLSI 
CAD  Applications"  (A.R.  Newton) 

General  purpose  multiple  instruction  stream,  multiple  data  stream  (MIMD)  multiprocessors 
offer  an  attractive  means  of  accelerating  many  computationally  intensive  tasks  that  arise  in  VLSI 
CAD  systems;  examples  of  such  tasks  include  circuit  simulation,  logic  simulation,  placement, 
routing,  design  rule  checking  and  fault  simulation.  MIMD  multiprocessors  are  attractive  because 
they  are  a  general  purpose  solution  that  can  also  be  exploited  as  high  performance  multipro- 
grammed  computers.  This  is  in  contrast  to  special  purpose  CAD  accelerators  that  only  support 
single  tasks  such  as  logic  simulatioa 

The  particular  class  of  shared-his,  shared-memory  multiprocessors  has  gained  commercial 
acceptance  with  offerings  by  Sequent,  Encore  and  others.  These  machines  offer  high  perfor¬ 
mance  at  reduced  cost  for  applications  with  modest  amounts  of  exploitable  parallism.  By 
efficiently  supporting  the  shared  memory  programming  paradigm  at  the  hardware  level,  these 
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machines  are  usually  much  easier  to  program  than  those  supporting  message  passing  because  the 
programmer  does  not  have  to  worry  about  distributing  data  across  multiple  memories.  SianA 
memory  is  especially  attractive  for  programming  CAD  applications  because  it  is  often  unclear 
how  to  efficiently  map  the  problem  at  hand  into  a  set  of  message  passing  processes. 

The  success  of  shared-bus  multiprocessors  is  largely  due  to  the  efficient  caching  techniques 
made  possible  because  of  the  shared  bus  [26].  Effective  caching  techniques  are  important  in  mul¬ 
tiprocessor  design  because  the  provision  of  a  cache  at  each  processor  masks  the  often  severe 
access  delay  to  main  memory  through  an  interconnection  network.  Unfortunately,  computers 
with  two  or  more  caches  require  techniques  for  ensuring  that  the  caches  remain  consistott:  all 
changes  to  a  piece  of  data  must  be  reflected  in  all  cached  copies.  It  is  now  fairly  well-understood 
how  to  enforce  cache  consistency  in  shared-bus  multiprocessors  by  exploiting  their  broadcast 
Cjq>abilities  [27].  Unfortunately,  shared-bus  architectures  support  relatively  few  processors  (prob¬ 
ably  SO  or  less),  and  it  is  unclear  how  to  enforce  consistency  in  more  scaleable,  non-bus  architec¬ 
tures;  this  is  the  problem  that  we  address  in  this  research. 

Cache  consistency  methods  proposed  to  date  may  be  divided  into  four  classes:  those  in 
which  shared  writeable  data  is  uncached,  "snooping"  protocols  (for  shared-bus  systems),  directory 
schemes,  and  software  assisted  techniques  [28];  the  first  three  classes  are  categorized  as  hardware 
based  techniques.  Hardware  techniques  enforce  consistency  in  a  manner  that  requires  no  special 
cache  control  instructions  in  program  object  code.  Software  techniques  enforce  consistency  by 
inserting  cache  control  instructions  in  object  code  at  compile  time.  Since  compilers  generaUy 
have  more  information  about  the  future  behavior  of  a  program  than  does  the  hardware  at  run  time, 
software  methods  offer  potentially  bener  performance  in  terms  of  reduced  average  access  time 
and  network  traffic.  Several  large,  non-bus  shared  memory  multiprocessor  designs-the  New 
York  University  Ultracomputer  [29],  IBM  RP3  prototype  [30],  and  University  Illinois  Cedar 
machine  [31]-have  dealt  with  the  consistency  problem  using  software  methods.  Only  recently 
has  work  been  reported  on  the  investigation  of  hardware  coherence  methods  for  large  rx>n-bus 
machines  [,32,33,34,35].  In  most  of  these  cases  the  methods  proposed  deal  with  specific  architec¬ 
tures  based  on  collections  of  buses  conneaed  in  meshes,  cubes  or  trees.  In  addition,  few  results 
have  been  reported  characterizing  the  behaviour  of  parallel  programs  running  on  large  numbers  of 
processors;  most  machine  designs  have  been  based  on  analytic  models. 

This  project  is  an  investigation  of  the  performance  of  cache  consistency  schemes  suitable 
for  large  scale  shared  memory  multiprocessors.  It  began  with  a  comparison  of  three  schemes 
using  execution  driven  simulation  of  three  parallel  CAD  programs.  The  first  coherence  scheme 
was  simple:  shared-writeable  data  is  not  cached.  This  scheme  provided  a  reference  point  for  the 
two  others.  Hie  second  scheme  was  a  sectored  variation  of  that  published  by  Censier  and  Feau- 
trier  [36]  in  which  tags  are  associated  with  each  block  of  main  memory  to  record  all  caches  hav¬ 
ing  copies;  whenever  a  processor  writes  to  shared-writeable  data  all  other  copies  are  invalidated 
using  the  data  stored  in  the  tags.  While  this  method  permits  shared-writeable  data  to  be  cached, 
the  additional  memory  required  for  tags  and  the  additional  network  traffic  required  for  invalida¬ 
tions  may  be  excessive.  The  third  scheme  was  novel  in  that  it  permitted  the  assignment  of  data  to 
I^iysical  main  memory  locations  to  change  dynamically.  It  required  main  memory  to  be  distri¬ 
buted  among  the  processors  and  handled  it  as  if  the  distributed  pieces  were  caches  themselves. 
Since  main  memory  is  usually  an  order  of  magnitude  larger  than  the  aggregate  cache  memory, 
network  traffic  may  be  reduced  if  local  memory  hit  rates  improve  and  consistency  traffic  is  low. 


The  comparison  demonstrated  three  principal  results.  First,  for  the  benchmarks  considered 
the  number  of  references  to  shared-writeable  data  is  sufficiently  high  to  justify  the  caching  of 
shated-writeable  data;  significant  reductions  in  average  memory  access  time  aiKl  network  traffic 
were  obtained  for  the  second  and  third  coherence  schemes  relative  to  the  first  Second,  the  com¬ 
ponent  of  network  traffic  due  to  sharing  and  synchronization  dominated  die  component  due  to 
normal  cache  misses.  This  showed  that  the  reduction  in  normal  miss  traffic  permitted  by  the 
larger  effective  cache  size  of  the  third  scheme  had  little  impact  on  improving  performance. 
Third,  the  simulations  showed  that  although  the  sectored  version  of  Censier  and  Feautrier's 
scheme  provided  substantial  performance  improvement  over  the  first  scheme,  the  large  block 
sizes  required  to  minimize  tag  overiiead  introduced  an  excessive  amount  of  false  sharing. 

The  initial  evaluation  has  been  extended  by  considering  the  use  of  tag  caches  as  a  way  to 
reduce  tag  overhead  while  minimizing  false  sharing.  In  a  tag  caching  scheme,  tags  are  not  associ¬ 
ated  with  each  block  of  main  memory  but  are  stored  in  an  associative  memory  indexed  by  Uock 
address.  This  permits  a  much  smaller  number  of  tags  at  the  expense  of  a  more  complex  main 
memory  controller.  Since  the  number  of  Ugs  is  much  smaller  than  the  number  of  main  memory 
blocks  in  such  a  scheme,  it  is  necessary  to  prematurely  displace  shared  data  from  caches  to  get  a 
free  tag.  preliminary  results  [37]  on  the  effect  of  various  tag  cache  sizes  on  network  traffic  and 
average  access  time  show  that  Censier  and  Feaurier’s  directory  method  performs  as  well  with  a 
tag  cache  as  with  a  full  directory.  Preliminary  implementation  considerations  suggest  that  a  tag 
cache  can  be  implemented  with  the  same  low  cost  dynamic  RAM  as  data  memory. 

This  work  is  currently  being  extended  by  evaluating  more  realistic  netwotk  models,  iiKlud- 
ing  direct  and  indirect  binary  cubes,  and  cubes  and  trees  of  busses.  Future  woik  vrill  also  address 
the  effectiveness  of  combining  as  a  technique  for  reducing  the  significant  performance  degrada¬ 
tion  caused  by  synchronization.  An  initial  evaluation  of  alternatives  will  use  approximate  queue¬ 
ing  models,  and  will  require  the  development  of  techniques  that  model  synchronization.  The 
most  promising  alternatives  will  then  be  evaluated  using  detailed  simulations.  This  work  will 
also  be  extended  by  acquiring  more  benchmaik  programs  and  by  considering  larger  numbers  of 
[Mw:essors. 


1.16.  SLIP:  System  Level  Interactive  Partitioning  (A.R.  Newton) 

Large  electronic  systems  are  usually  constructed  as  a  hierarchy  of  physical  components, 
such  as  backplanes,  boards,  chip,  and  transistors.  This  hierarchy  is  typically  defined  as  part  of  the 
system  design  process,  for  example  as  functionality  is  assigned  to  each  chip.  Alternately,  an 
existing  design  might  be  reimplemented  as  a  different  hierarchy  to  take  advantage  of  changes  in 
VLSI  technologies.  For  example,  a  TTL  system  could  be  reimplemented  on  a  number  of  gate- 
array  chips. 

This  problem  poses  conflicting  demands  upon  algorithms  which  can  be  used  to  automati¬ 
cally  generate  system  hierarchies.  The  problem  size  makes  the  speed  and  memory  efficiency  of 
the  algorithms  critical  but  the  design  must  be  modeled  and  optimized  with  enough  detail  to 
ensure  that  any  constraints  which  are  imposed  by  the  packaging  or  system  designer  are  satisfied. 
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We  feel  that  this  problem  cannot  be  solved  efficiently  with  a  single  algorithm.  Radier,  a 
sequence  of  algorithms  which  are  optimized  to  perform  specialized  tasks  must  be  used.  The  ini¬ 
tial  algorithms  in  the  sequence  partition  and  cluster  the  components  of  the  system  into  a  manage¬ 
able  number  of  tightly  conneaed  components.  ■  This  will  reduce  the  size  of  the  problem  so  that  a 
more  careful  algorithm  can  be  used  to  assign  clusters  to  packages  and  ensure  that  no  package  or 
user  constraints  are  violated. 

SLIP  [38]  is  a  framework  for  the  development  of  algorithms  that  generate  and  improve  sys¬ 
tem  hierarchies.  SLIP  supports  a  representation  of  the  hierarchy  in  the  OCT  [39]  data  model  and 
provides  a  library  of  routines  to  maintain  this  representation  as  changes  are  made  to  the  hierarchy. 
These  routines  simplify  the  implementation  of  partitioning  algorithms  and  ensure  the  consistency 
of  the  representation  as  a  sequence  of  algorithms  are  applied  to  the  problem. 

Status  and  Future  Woik 

Recent  work  on  SLIP  has  focused  upon  the  development  of  an  improved  user-interface  and 
a  resource  model  which  models  the  capacities  and  limitations  of  the  package  technologies.  The 
new  user  interface  is  built  as  an  RPC  application  to  the  VEM,  the  Octtools’  graphical  editor  and 
is  a  significant  improvement  over  the  old  user  interface.  We  hope  that  this  interface  will  allow 
others  to  evaluate  SLIP  and  to  give  us  feedback  and  access  to  examples. 

The  resource  model  provides  a  representation  of  packaging  technologies  which  is  simple 
enough  to  be  used  as  a  basis  for  optimization,  yet  sufficiently  general  to  model  most  real-world 
technologies.  This  model  relies  upon  user-defined  callback  functions  to  evaluate  package 
resources,  and  so  should  enable  SLIP  to  be  easily  configured  to  new  technologies. 

The  user  interface  and  the  package  resource  model  are  near  completion,  although  more 
work  is  expected  as  users  adopt  SLIP  and  more  realistic  packaging  technologies  are  modeled.  In 
the  near  future,  we  hope  to  model  an  industrial  ASIC  technology  and  to  test  SLIP  by  partitioning 
a  set  of  examples  into  this  technology. 

Our  long-term  goals  are  to  develop  a  timing  model  in  SLIP  and  to  develop  delay-based  par¬ 
titioning  methods  which  will  minimize  the  effect  of  panitioning  upon  the  system’s  performance. 


1.17.  Applying  Synthesis  Techniques  to  Aid  Simulation  (A.R.  Newton) 

Simulation  is  an  important  tool  in  many  areas  of  circuit  design.  This  project  focuses  on  the 
simulation  of  a  design  specification,  where  the  goal  is  to  determine  if  the  specification  fits  the 
need  for  which  it  was  developed  (as  opposed  to  simulation  whose  goal  is  to  determine  if  a  design 
matches  its  specification). 

In  common  with  all  simulators,  those  operating  on  specifications  are  generally  slower  than 
one  would  like.  One  technique  that  has  been  used  to  speed  up  all  types  of  simulators  is  to  pro¬ 
duce  program  fragments  to  evaluate  functions  that  need  to  be  simulated,  and  compile  these  frag¬ 
ments  into  machine  code  for  some  target  computer.  Although  such  a  ‘‘compiled-code”  simulator 
gains  some  speed  benefit,  it  generally  simulates  the  specification  as  written,  which  was  not  optim¬ 
ized  to  make  the  best  use  of  the  characteristics  of  the  target  machine.  In  fact,  since  it  is  a 
specification,  it  is  only  optimized  for  clarity.  This  project  seeks  improvements  in  simulation 


q>eed  by  treating  simulation  as  a  circuit  synthesis  problem,  where  the  objective  is  to  create  a 
'‘circuit”  that  will  simulate  at  high  speed  on  the  target  machine,  rather  than  to  create  a  circuit 
that  takes  up  a  small  area  on  a  chip.  Although  these  simulation  goals  are  frequently  similar,  they 
are  not  necessarily  so.  If  the  target  machine  is  a  massively  parallel  single-instructionAnultiple- 
data  machine,  for  example,  a  significant  advantage  may  be  gained  by  mining  the  design  onto 
many  instances  of  a  single  primitive  function,  so  that  all  the  processors  can  be  doing  the  same 
thing  all  the  time.  On  a  conventional  machine,  it  may  be  important  to  structure  the  “drcuit”  so 
drat  its  width  is  limited,  allowing  the  computer  to  hold  the  temporary  variables  that  represent 
internal  signals  in  its  high-speed  registers. 

To  date,  a  program  has  been  written  to  create  a  compiled-code  simulation  of  combinational 
logic  blocks  using  three-value  logic  simulation.  Fairly  conventional  logic  optimization  tech¬ 
niques  have  been  applied,  and  seem  generaUy  to  improve  the  simulation  speed  by  a  factor  of  2-5 
for  the  larger  examples  studied.  Current  work  involves  partitioning  the  initial  design  and  optim¬ 
izing  the  parts:  this  seems  to  reduce  the  simulation  speed  on  the  order  of  20%,  while  decreasing 
the  time  spent  in  optimization  by  a  factor  varying  from  1  to  2.  It  has  the  important  benefit  of 
allowing  the  optimization  to  complete  on  more  examples  without  running  out  of  memory.  In  the 
coming  months,  we  plan  to  port  the  simulation  to  a  massively  parallel  machine,  and  investigate 
more  machine-specific  optimizations. 


1.18.  A  Generalized  Approach  to  the  Constrained  Cubical  Embedding  Problem  (A.R. 
Newton) 

In  this  research,  an  efficient  generalized  approach  to  the  constrained  cubical  embedding 
problem  is  proposed.  Optimal  cubical  embedding  is  tightly  related  to  various  encoding  and  state 
assignment  problems  in  high-level  synthesis.  The  goal  of  cubical  embedding  aims  at  the  embed¬ 
ding  of  symbolic  values  onto  the  vertices  of  a  Boolean  hypercube  based  on  the  satisfaction  of 
constraints  and  some  objective  criterion.  This  is  a  known  difficult  combinatorial  optimization 
(M-oblem  whose  complexity  has  been  shown  to  be  NP-complete.  [40]. 

Previous  constrained  cubical  embedding  algorithms  were  fonnulated  to  solve  specialized 
constraints.  These  constraints  result  from  different  symbolic  minimization  procedures  and  prob¬ 
lem  formulations.  For  example,  in  [41],  a  symbolic  minimization  procedure  for  two-level  state 
assignment  that  considers  the  input  fields  in  the  state  transition  table  was  proposed.  Two-level 
multiple  valued  minimization  was  used  to  obtain  a  reduced  symbolic  cover  along  with  a  set  of 
embedding  constraints.  These  constraints,  called  "face  embedding  constraints,"  are  to  be  satisfied 
in  the  embedding  proce.''s  to  obtain  a  compatible  Boolean  cover. 

In  [42],  an  extended  two-level  symbolic  minimization  procedure  was  proposed  for  the 
encoding  of  symbolic  outputs.  The  corresponding  output  constraints,  called  "output  dominance 
constraints,”  are  generated  in  the  minimization  process.  Algorithms  have  been  proposed  to  solve 
these  output  constraints  [43],  but  combining  them  with  input  constraints  has  been  unsatisfactory. 

While  specialized  algorithms  are  effective  for  their  restricted  domains,  generalizing  them  to 
tackle  a  variety  or  combination  of  constraints  is  a  formidable  task  since  the  nature  of  these  con¬ 
straints  are  quite  different.  Therefore,  a  robust  generalized  approach  would  be  desirable  for  solv¬ 
ing  the  different  embedding  problems.  It  should  be  easily  extendible  to  problem  fomnulations 
and  cost  objectives. 
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Probabilistic  hill  climbing  has  proved  to  be  an  effective  approach  to  complex  combinatorial 
problems  in  design  automation  (44].  Depending  on  the  nature  of  the  configuration  space,  proba¬ 
bilistic  hill  climbing  techniques  have  been  shown,  in  fact,  to  produce  superior  results  to  special¬ 
ized  algorithmic  approaches  on  some  combinatorial  problems  with  comparable  epu  time. 

In  this  work,  a  new  generalized  framework  for  the  constrained  cubical  embedding  problem 
using  probabilistic  hill  climbing  techniques  has  been  developed.  Although  probabilistic  hill 
climbing  in  general  may  not  be  suitable  to  all  combinatorial  optimization  problems,  our  results  . 
strongly  demonstrate  that  it  is  extremely  effective  for  this  particular  problem  domain. 

The  approach  has  been  implemented  in  a  package  called  CUBIC.  The  generalized  solver  is 
separated  from  the  constraints  so  that  application  specific  constraints  and  objective  criteria  can  be 
incorporated  in  a  straightforward  manner.  We  have  obtained  experimental  results  on  a  large  set 
of  design  examples  using  our  generalized  approach.  The  results  obtained  are  comparable  or  supe¬ 
rior  to  those  by  specialized  methods  in  similar  epu  time. 


1.19.  Symbolic  Encoding  of  High-Level  Descriptions  for  Multi-Level  Implementations 
(AJi.  Newton) 

In  this  research,  the  problem  of  symbolic  encoding  for  automated  synthesis  of  high-level 
descriptions  is  addressed.  Well  established  logic  synthesis  procedures  have  been  developed  for 
implementations  on  two-level  macros  such  as  PLAs  [49].  These  techniques  are  aimed  at  minim¬ 
izing  the  number  of  product  terms  in  the  final  output.  More  recently,  considerable  attention  has 
been  devoted  to  the  development  of  multi-level  optimization  procedures  for  multi-level  imple¬ 
mentations  such  as  standard  cells  or  CMOS  complex  gate  technologies  [45]. 

In  standard  logic  synthesis  systems,  the  input  is  a  Boolean  description  in  the  form  of  either 
a  binary  truth  table  or  a  high-level  specification  using  a  Hardware  Description  Language  (HDL). 
At  Berkeley,  the  input  specification  to  our  logic  synthesis  system  MIS  can  be  a  high-level 
description  wrinen  in  the  BDS  language.  Since  the  description  is  a  binary  representation,  the 
only  data  types  allowed  are  bit-vectors  of  binary  values.  However,  in  high-level  descriptions,  the 
ability  to  represent  the  values  of  some  signals  at  a  higher  level  of  abstraction  would  be  greatly 
desirable.  As  an  example,  it  would  be  desirable  to  represent  the  internal  states  of  a  finite  state 
machine  description  with  mnemonics  (strings  of  characters)  rather  than  forcing  the  user  to  specify 
the  binary  encodings.  For  a  custom  microprocessor,  it  may  be  desirable  to  represent  the  values  of 
the  instruction  stream  with  mnemonics  (eg.  ADD,  COMP,  BRANCH,  etc).  But  mote  impor¬ 
tantly,  the  result  of  the  logic  optimization  procedure  is  heavily  dependent  on  the  encodings  of 
these  symbolic  values. 

Therefore,  we  are  developing  a  Hardware  Description  Language  that  allows  the  designer  to 
specify  as  many  different  sets  of  symbolic  values  as  required.  Each  of  these  sets,  called  a  sym¬ 
bolic  type,  describes  the  admissible  values  allowed.  Symbolic  variables  can  then  be  defined 
using  these  symbolic  types.  A  special  symbolic  type  is  the  Boolean  type  which  has  the  admissi¬ 
ble  values  (0,  1 }  corresponding  to  binary  logic  descriptions.  As  part  of  a  synthesis  system,  a 
symbolic  encoding  procedure  has  been  developed  that  anempts  to  optimally  encode  symbolic 
values  into  binary  bit-vectors.  This  optimization  procedure  has  been  implemented  in  a  program 
called  JEDI  [50]. 
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A  special  case  of  the  symbolic  encoding  problem  is  the  standard  state  assignment  protlem 
in  which  only  the  state  variables  are  symbolic  assuming  one  set  of  admissible  values.  The  prob¬ 
lem  we  address  here  is  considerably  more  general.  As  such,  our  symbolic  encoding  procedure 
can  effectively  solve  the  state  assignment  problem  as  well.  Optimal  state  assignment  procedures 
have  been  developed  for  both  two-level  and  multi-level  implementations  [47][46].  The  objective 
criterion  previously  used  for  two-level  optimization  is  the  final  area  of  a  PLA  implementation 
using  the  number  of  product-terms  as  an  approximate  indicator.  For  the  multi-level  case,  a  de 
facto  objective  criterion  is  to  minimize  the  literal  count 

For  the  general  problem  of  symbolic  encoding,  an  optimization  procedure  targeted  for  two- 
level  implementations  has  been  described  in  [48].  Our  symbolic  encoding  procedure  is  targeted 
for  multi-level  implementations.  As  with  multi-level  state  assignment,  we  use  the  final  literal 
count  after  logic  minimization  as  the  optimization  criterion. 


1.20.  Exploring  Equivalent  State  Machines  and  State  Assignment  (A.R.  Newton) 

Finite  state  machines  (FSM)  can  be  specified  in  the  form  of  a  state  transition  graph  (STG). 
The  eventual  implementation  of  the  state  machine  is  in  the  form  of  a  sequential  logic  network 
consisting  of  combinational  logic  gates  and  synchronous  registers.  Given  a  finite  state  machine 
specification,  there  may  be  many  possible  implementations.  The  goal  of  synthesis  is  to  find  the 
one  with  the  least  cost 

Traditionally,  this  step  is  performed  via  an  optimization  process  called  state  assignment 
The  goal  of  state  assignment  is  to  optimally  assign  binary  codes  to  the  internal  states  of  a  finite 
state  machine  such  that  the  resulting  synthesized  logic  is  minimized.  The  assignment  is  restricted 
such  that  no  two  states  are  assigned  the  same  binary  combination.  The  problem  of  optimal  state 
assignment  has  been  a  subject  of  extensive  research.  Numerous  techniques  have  been  developed 
to  solve  this  problem.  Most  notably  are  contemporary  methods  that  have  a  tight  coupling  to  the 
underlying  logic  synthesis  algorithms  as  to  assure  the  optimality  of  the  encoded  results. 

While  effective,  traditional  state  assignment  techniques  do  not  explore  possible  state  assign¬ 
ments  from  other  equivalent  state  machine  specifications.  Therefore,  the  solutions  obtainable  via 
standard  state  assignment  may  be  suboptimal.  That  is,  a  better  realization  may  have  been  possi¬ 
ble  from  an  equivalent  machine  with  a  different  structure.  In  general,  two  machines  can  exhibit 
the  same  overall  terminal  behavior  even  though  their  respective  structures  are  different  (non- 
isomorphic). 

Equivalent  machines  can  be  realized  by  merging  or  splitting  states.  We  refer  to  such 
transformations  as  restructuring.  State  reduction  techniques  have  been  proposed  for  modifying 
the  machine  structure  by  identifying  equivalent  states  and  systematically  eliminating  them.  The 
primary  objective  of  state  reduction  is  to  obtain  an  equivalent  machine  with  the  minimal  number 
of  states.  However,  it  is  well  known  that  a  reduced  machine  does  not  necessarily  lead  to  a  better 
logic  implementation  and  thus  offers  no  guarantee  as  to  the  quality  of  the  final  solution. 

In  this  work,  the  relationship  between  state  machine  restructuring  and  state  assignment  is 
explored.  One  approach  is  to  combine  these  steps.  Thus,  rather  than  using  the  number  of  states 
to  guide  the  state  reduction  process,  we  hope  to  make  restructuring  decisions  implicitly  via  a 
more  general  state  encoding  problem. 
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2.  ARCHITECTURES  AND  APPLICATIONS 


2.1.  VLSI  ASIC  and  System  for  Hard  Real-Time  Tracking  (R.W.  Brodersen) 

Accurate  detection  and  tracking  of  moving  objects  at  video  rates  is  an  important  problem  in 
the  vision  control  of  robots  and  in  other  machine-vision  tasks.  In  real-world  applications,  vari¬ 
able  lighting  conditions,  video  noise,  properties  of  the  objects,  and  the  background  of  the  scene 
further  complicate  this  task.  In  particular,  vision-controlled  robots  require  highly  accurate  real¬ 
time  information  about  their  work  scene. 

Standard  low-level  (early-vision)  image-processing  algorithms  typically  require  highly  con¬ 
trolled  lighting  and  noise-ffee  images  to  achieve  real-time  or  near  real-time  recognition  of  objects 
in  the  scene.  Transforming  images  into  the  Radon  space  allows  these  and  other  constraints  on  the 
images  to  be  relaxed.  The  Radon  transform  provides  robust  recognition  of  lines  and  line  seg¬ 
ments  in  images  in  the  presense  of  noise  and  suboptimal  lighting.  Recognition  and  tracking  of 
objects  then  proceeds  in  the  Radon  space. 

Objects  are  trained  into  the  system  by  performing  detailed  analysis  of  their  representation  in 
the  Radon  space.  Depending  on  the  accuracy  required  for  the  particular  application.  tiK  objects 
can  then  be  recognized  by  a  variety  of  algorithms  in  the  Radon  domain.  Once  the  system  locks 
on  an  object,  tracking  the  object  involves  incremental  analyses  of  slices  through  the  Radon 
domain. 

A  new,  highly  integrated  9U  VME-based  board  has  been  built  that  implements  the  Radon 
transform  in  real  time.  Each  board  has  four  complete  custom  AT&T  DSP32C  microprocessor 
cores,  each  of  which  serve  as  a  local  host  Each  of  these  DSPs  provide  the  necessary  modularity 
and  programmability  to  support  eight  custom  Radon-transform  application-specific  ICs  (ASICs), 
for  a  total  of  32  ASICs  on  a  board.  Parallel  and  pipelined  processing  occurs  at  both  the  chip  and 
the  board  level.  Muluple  Radon-transform  boards  are  installed  in  a  single  21 -slot  VME  card 
cage,  for  a  total  processing  power  capability  of  one-quarter  trillion  operations  per  secotvl.  The 
card  cage  has  a  68020-based  single-board  computer  and  an  Ethernet  communications  card  to  pro¬ 
vide  support  as  a  global  host  to  all  of  the  slave  processors  and  local-area  network  support  to 
nearby  workstations.  The  card  cage  runs  a  real-time  operating  system  that  resembles  UNIX  to 
facilitate  low-latency  transfers  of  recognition  primitives  over  the  VME  bus.  Communications 
and  control  of  the  image-processing  cards  occurs  via  remote  workstation  through  RPC  and  X- 
windows. 


The  VLSI  ASIC  that  executes  the  highly  computational-  and  I/O-intensive  Radcm  transfonn 
algorithm  has  been  fabricated  in  a  1.6-pm  process  through  MOSIS.  It  operates  at  the  required 
10-MHz  video  rate,  and  consists  of  over  100,000  transistors.  Its  internal  architecture  is  flexible; 
can  be  programmed  externally  through  the  DSP32C  to  perfonn  both  forward  and  reverse  Radon 
transforms  and  grey-level  histograms  and  to  act  as  a  general-purpose  RAM.  Built-in-Iine-length 
counters  facilitate  line-length  normalization.  Hardware  support  is  also  provided  to  allow  the 
region  of  interest  over  which  the  Radon  transfonn  is  taken  is  to  be  detrermined  externally.  The 
ASIC  supports  both  interlace  and  noninterlace  image  formats. 

Specific  applications  of  the  image-processing  board  include  accurate  posititming  of  a  four- 
degree-of-freedom  manipulator  for  laser-reconfigurable  integrated  circuits.  A  video  camera 
attached  to  a  microscope/probe  station  allows  the  system  to  recognize  objects  in  the  circuit  and  to 
correct  the  position  of  the  IC.  Translation  and  orientation  are  determined  at  the  video  Inane  rate? 
The  system  can  teach  and  recognize  a  number  of  different  styles  of  objects;  flius  the  image- 
processing  algorithms  can  effectively  increase  the  accuracy  of  the  manipulator  to  that  of  its  reso¬ 
lution. 

Other  applications  of  the  tracking  system  include  a  six-degree-of-freedom  robot  with  cus¬ 
tom  control  hardware  that  is  integrated  with  the  real-time  image  processing  hardware. 


2.2.  Real-Time,  Flexible  Image-Processing  Printed  Circuit  Board  (R.W.  Brodersen) 

This  project  integrates  many  real-time  image  processing  modules  into  a  complete,  compaa 
system.  The  system  incorporates  (1)  a  custom  VLSI  low-level  image-processing  chip  set,  as 
designed  by  Ruetz  [1],  (2)  a  custom  VLSI  histogram  processor  and  equalization  chip  set,  as 
designed  by  Richards  [2],  and  (3)  support  for  an  external  image  processor  board. 

Any  or  all  of  the  three  image  processors  can  be  used,  in  any  order,  with  commercial  A/D- 
D/A  and  frame  buffer-boards.  This  system  can  be  reconfigured  with  the  aid  of  two  custom  VLSI 
video  crossbar  switches,  designed  and  fabricated  in  a  2.0-pm  CMOS  process.  These  two 
application-specific  IC  designs,  fabricated  in  132  and  68-pin  packages,  are  fabricated  and  are 
fully  functional  at  lOMHz  video  rctes.  Four  of  the  132-pin  version  and  one  of  the  68-pin  version 
provide  24-bit  (color)  and  8-bit  multiplexing  of  video  buses  as  required  by  the  chosen  image- 
processing  algorithm.  Internal  pipeline  registers  ensure  10-MHz  throughput,  and  an  external  pro¬ 
cessor  can  write  to  the  internal  configuration  registers. 

The  entire-image  processing  system  has  been  fabricated  on  a  9U  VME  board  and  is  fully 
functional  at  video  rates.  It  contains  16  custom  ASICs  and  2  programmable  logic  devices  (PLDs) 
for  interface  logic.  A  slave  VME  bus  interface  provides  control  of  all  the  image  processors  and 
video  multiplexer  configurations.  The  board  resides  in  a  21-slot  stand-alone  card-cage  along  with 
a  68020-based  CPU  card,  an  Ethernet  card,  and  a  custom  robot  controller  card  [3].  The  board  is 
connected  by  Ethernet  to  a  Sun  workstation  and  is  controlled  through  a  set  of  X-window  front- 
end  tools.  Window-based  software  has  been  written  to  reconfigure  the  processors  in  any  arbitrary 
fashion. 
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23.  Real  Time  Image  Data  Compression  (R.W.  Brodersen) 

This  research  investigates  and  extends  existing  algorithms  and  architectures  for  image  data 
compression.  The  study  is  geared  towards  full-motion  video  transmission,  which  requires 
compression  from  an  initial  rate  of  62.9  Mbits/s  (for  images  defined  over  a  512  x  512  lattice, 
assuming  8  bits  per  pixel  and  30  frames  per  second)  to  a  reduced  rate  of  1.5  Mbits  per  second 
0.e.,  a  40:1  compression).  Color  images  require  hi^er  compression  rates.  For  this  application, 
the  compression  algorithms  are  constrained  by  the  need  for  real  time  performance  and  low  power 
dissipation. 

The  two  traditional  classes  of  compression  techniques  (transform  coding  and  spatial  domain 
coding)  are  being  simulated  with  special  emphasis  on  interfirame  hybrid  transform  coding.  In  this 
coding  scheme,  a  two-dimensional  transform,  typically  the  discrete  cosine  transform  (DCT)  is 
performed  on  sub-blocks  of  the  image,  and  the  error  signal  from  recursive  estimation  of  previ¬ 
ously  transmitted  transform  coefficients  is  transmitted.  This  method  exploits  the  spatial  redun¬ 
dancy  within  the  image  to  achieve  high  compression  rates.  Other  techniques  that  improve 
compression,  such  as  motion  detection  and  estimaton  ,  are  also  to  be  included.  A  major  concern 
of  this  research  is  to  adapt  appropriate  image-compression  algorithms  for  efficient  VLSI  imple¬ 
mentation.  Low  power  design  techniques  and  technology  factors  such  as  scaling  will  play  a  major 
part  in  the  development  of  a  low  power  and  compact  compression  system. 

A  prototype  real  time  image  compression/decompression  system  is  currently  being 
designed.  The  system  implements  a  generic  architecture  capable  of  simulating  various  image 
compression  algorithms  in  real  time  including  intraframe  coding,  interfirame  coding,  and  motion 
compensation.  Special  emphasis  will  be  placed  on  the  new  CCITT  standard  for  full  motion 
image  data  compression  at  multiples  of  64  kilobits  per  second.  The  board  is  being  desined  with 
commercial  custom  compression  chip  set  including  a  DCT  (  and  inverse  DCT)  processor,  a 
Motion  Estimation  Processor  .  and  a  Quantization  Processor.  It  will  also  contain  local  video 
frame  buffers,  and  signal  processors  to  perform  various  miscellaneous  tasks  such  as  image  data 
indexing,  and  variable  length  coding.  Software  control  will  be  used  to  reconfigure  the  board  for 
different  image  compression  algorithms. 


2.4.  Programmable  IC  Digital  Signal  Processor  with  Self-Timed  Internal  Processing  Ele¬ 
ments  (R.W.  Brodersen) 

While  digital  signal  processing  algorithms  for  many  applications  have  been  under  study  for 
decades,  it  is  only  recently  that  a  variety  of  general  purpose  digital  signal  processor  (DSP)  chips 
have  become  available  to  execute  these  algorithms.  Many  products,  especially  in  the  telecom¬ 
munications  area  have  begun  to  employ  DSPs  in  their  design.  It  is  thought  that  the  continued 
scaling  of  the  IC  process,  which  yields  higher  density  and  faster  transistors,  should  allow  even 
more  sophisticated  algorithm  implementation.  However,  limitations  in  the  way  scaling  can  be 
performed  along  with  continued  growth  in  chip  areas  may  interfere  with  the  ability  to  take  advan¬ 
tage  of  the  fast  devices.  Namely,  the  time  taken  for  signals  to  traverse  a  chip  through  the  inter¬ 
connection  layers  is  becoming  significant  with  respect  to  the  clock  periods  of  processors.  Using 
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today’s  synchronous  “docked”  design  techniques  will  probably  not  be  adequate  in  the  future 
because  of  the  difficulty  in  retaining  global  synchrortization  as  clock  rates  exceed  lOOMHz. 

The  use  of  self-timed  circuits  (circuits  that  generate  completion  information)  to  construct 
fully  asynchronous  processors  is  one  possible  solution  to  the  synchronization  problem  mentioned 
above.  This  idea  forces  timing  signals  to  be  generated  locally,  bringing  the  synchronization  prob¬ 
lem  back  down  to  a  manageable  size.  The  drctiitry  also  monitors  its  own  timing  signals  so  that  it 
can  absorb  variations  in  time  used  for  logic  signal  evaluation  and  traversal  through  interconnect 
and  still  function  correctly.  In  this  way.  the  dock  distribution  problem  can  be  alleviated  and  the 
design  of  a  digital  signal  processor  will  truly  be  scalable  with  the  technology  advancements. 
Also,  since  a  docked  system  inherently  cannot  take  advantage  of  tire  differences  in  dday  associ¬ 
ated  with  different  elements  of  a  signal  processor  (such  as  ALU,  shifter,  multiplier),  the  self- 
timed  approach  may  be  able  to  take  advantage  of  these  differences  to  obtain  a  faster  cycle  time. 

In  an  asynchrorx)us  system,  each  block  generates  a  completion  signal  to  indicate  that  it  has 
performed  its  task.  Interface  circuits  (also  known  as  handshaking  circuits)  make  use  of  these 
completion  signals  to  control  the  proper  transfer  of  data  between  stages.  The  circuit  designer 
need  only  be  concerned  with  the  operation  of  each  individual  block  or  macrocell  as  is  typicaUy 
done  is  lower  speed  clocked  designs  where  global  synchronization  is  not  an  issue.  Also,  by 
separating  the  interface  circuits  from  the  computation  circuits,  system  timing  becomes  a  matter  of 
designing  only  the  proper  collection  of  interface  circuits  that  realize  the  desired  system  operation. 

Self-timed  datapath  macrocells  were  developed,  fabricated  and  tested.  Each  of  these  imple¬ 
ments  a  datapath  function  such  as  shifter,  multiplier,  and  ALU  and  provides  completion  informa¬ 
tion.  Combining  the  datapath  cells  with  RAM  and  ROM  cells,  a  fully  asynchronous  ROM  pro¬ 
grammable  digital  signal  processor  was  designed  and  fabricated.  This  is  the  first  implementation 
known  of  a  fully  asynchronous  DSP  that  contains  feedback  and  that  is  programmable.  In  the 
DSP,  a  data  stationary  architecture  is  used,  where  the  control  signals  for  each  stage  of  the  data¬ 
path  pipeline  move  along  with  the  data  flowing  through  the  datapath.  This  fits  the  self-timed 
paradigm  well  since  the  data  transfers  themselves  occurr  at  times  not  synchronized  with  sane 
global  clock.  The  I/O  ports  on  the  DSP  are  also  self-timed  so  that  the  processor  will  automati¬ 
cally  wait  for  external  devices  to  be  ready  for  transfers. 

Three  versions  of  the  self-timed  DSP  were  fabricated.  Each  was  programmed  to  perfonn  a 
simple  signal  processing  task.  The  first  chip  implements  a  16-tap  FIR  lowpass  filter,  the  second 
chip  implements  an  8-pole  UR  bandpass  filter,  and  the  third  chip  contains  a  test  program  to  deter¬ 
mine  that  the  functionality  of  all  of  the  different  types  of  instructions  is  correct  The  instruction 
cycle  time  depends  both  on  the  type  of  instruction  and  the  data  in  the  datapath.  The  basic  "fast” 
instruction  is  a  shift/add  operation.  A  hardware  multiplier  is  available  too,  but  since  it  performs 
multiplication  in  an  iterative  fashion,  the  cycle  time  when  it  is  used  is  longer.  Other  differences 
in  cycle  time  are  attributed  to  the  cairy-propogation  time  in  the  ALU  adder  circuit,  which  is 
data-dependent.  The  self-timed  DSP  was  fabricated  in  2um  N-well  MOSIS  process  and  a  micro- 
l^tograph  is  shown  in  Figure  1.  Measured  cycle  times  were  roughly  75nsec  for  the  fast  instruc¬ 
tion  and  32Snsec  for  a  multiplication.  Further  gains  over  synchronous  designs  are  expected  in  a 
more  highly  scaled  technology. 

The  asynchronous  DSP  has  been  described  in  two  recent  publications:  "A  Fully  Asynchro- 
rx)us  Digital  Signal  Processor  using  Self-Timed  Circuits”  was  presented  at  the  1990  IEEE  Inter¬ 
national  Solid  State  Circuit  Conference  in  San  Francisco,  CA.  "Self-Timed  Integrated  Orcuits 
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for  Digital  Signal  Processing”  was  published  as  a  Memo  of  the  Electronics  Research  Laboratory 
at  UC  Berkeley  (Memorandum  No.  UCB/ERL  M89/128).  Another  journal  article  describing  the 
processor  is  currently  being  prepared  and  it  will  be  submined  to  the  IEEE  Journal  on  Solid  State 
Circuits. 


2J.  TADS:  A  Test  Application  and  Development  System  (R.W.  Brodersen) 

As  the  complexity  of  integrated  circuits  increases  and  surface-mount  interconnection  tedi- 
nology  evolves,  significant  new  problems  arise  in  testing  of  board-level  systems.  The  tfff  Joint 
Test  Action  Group  has  proposed  a  design-for-testability  standard  which  alleviates  diese  problems. 
They  have  proposed  a  boundary-scan  architecture  which  is  comprised  of  a  boundary-scan-register 
(BSR)  and  a  test  access  port  (TAP)  that  allows  access  both  to  the  BSR  and  to  other  embedded  test 
circuitry.  These  BSRs  allow  control  and  observation  of  any  node  in  the  system,  improving  its 
testability. 

Incorporating  this  design-for-testabilit)'  standard  into  our  board-level  designs  requires  a  sys¬ 
tem  that  does  it  automatically.  The  system  should  also  provide  software  that  supervises  and 
administers  a  test  and  configures  the  testing  hardware.  TADS,  which  is  a  TEST  APPLICATION 
and  DEVELOPMENT  SYSTEM,  is  currently  being  developed  to  accomplish  these  tasks. 

The  current  research  has  two  objectives: 

(1)  The  first  objective  is  to  design  a  test  controller  module  and  a  library  of  test  support  cells. 
The  test  controller  module  interfaces  with  other  modules  via  TAP  bus  and  executes  the 
tests. 

(2)  The  second  objective  is  to  develop  software  that  automatically  generates  test  patterns, 
automatically  synthesizes  the  test  hardware,  and  analyzes  the  test  results. 


2.6.  Design  Tools  for  Rapid  Design  of  Oversampling  A/D  Converters  (R.W.  Brodersen) 

The  goal  of  this  project  is  to  provide  tools  that  will  aid  designers  in  exploring  tradeoffs  in 
the  design  of  these  circuits.  Current  methods  of  design  rely  on  extensive  time  domain  simula¬ 
tions  and  expertise  in  the  areas  of  switched  capacitor  analog  circuit  design  and  digital  filter 
design. 

An  evaluation  was  made  of  the  current  Berkeley  tools  for  the  synthesis  of  basic  analog 
building  blocks.  These  tools  are  lacking  in  generality  and  are  not  well  tested.  Since  oversam¬ 
pling  A/D  converters  require  small  amounts  of  analog  circuitry  that  can  be  reused  in  many 
designs,  it  makes  more  sense  to  design  leafcells  which  can  be  characterized  through  testing  rather 
than  trying  to  automate  the  analog  design  process  for  each  new  design.  To  this  end,  a  second 
order  delta  sigma  modulator  has  been  fabricated  and  will  be  tested  and  charaaerized  in  the  com¬ 
ing  months. 


Currently,  a  collection  of  C  language  programs  exists  for  simulating  the  modulators,  but  the 
previous  hardware  platform  used  for  circuit  testing  and  data  acquisition  has  become  our  oted.  A 
new  set  of  testing  boards  is  being  designed.  The  goal  is  to  provide  a  means  of  collecting  data 
from  actual  circuits  and  uploading  the  data  through  a  VME  interface.  The  simulation  and  anlysis 
programs  will  then  be  used  to  characterize  circuit  behavior. 

Oversampling  A/D  converters  require  digital  filters  for  eliminating  out  of  band  noise  prior 
to  resampling  at  a  lower  rate.  These  filters  tend  to  be  costly  in  area.  An  area  efficient  architec¬ 
ture  has  been  identified  for  the  first  digital  filter  that  processes  the  1  bit  ouqjut  of  the  modulator. 
Efforts  are  being  made  to  write  a  program  that  can  map  a  finite  impluse  response  onto  this  archi¬ 
tecture  automatically,  providing  sdl  files  as  ouqpuL 


2.7.  Active-Word  Processor  for  a  Real-Time,  Large-Vocabulary,  Continuous  Speech 
Recognition  System  (R.W.  Brodersen) 

The  active-word  processor  controls  the  active-word  and  active-list  memories.  Data  comes 
from  two  sources,  the  Viterbi  processor  and  the  grammar  subsystem.  The  former  requests  that  a 
currently-active  word  be  added  to  the  active-word  memory  if  this  word  is  likely  to  still  be  active 
during  the  next  frame;  the  latter  requests  additions  to  the  active-word  memory  if  a  highly  prob¬ 
able  word,  having  just  ended,  generates  one  or  more  successors.  These  requests  may  come  at  any 
time  so  the  active-word  processor  has  to  arbitrate  between  requests  in  order  to  access  each 
memory  at  most  otKe  per  clock  cycle.  The  processor  must  therefore  stall  the  Viterbi  processor 
and/or  refuse  data  from  the  grammar  subsystem  when  necessary. 

The  processor  uses  data  stations  to  store  requests.  When  a  request  arrives,  it  is  loaded  into 
the  first  available  station.  Once  in  the  station,  it  is  processed  automatically.  Processing  requires  6 
dock  cycles,  after  which  the  station  is  available  for  the  next  request  Each  request  requites  two 
accesses  to  the  active-word  memory  and  two  accesses  to  the  active-list  memory.  Therefore,  3 
data  stations  are  required.  If  a  newly-loaded  request  conflicts  with  another  request  that  is  being 
processed  in  another  station,  the  new  request  is  ignored  until  the  other  request  is  completed. 

There  are  several  datapaths  in  tire  processor.  They  are  the  grammamode  probability  data¬ 
path  (16  bits),  wordarc  datapath  (20  bits),  tag  datapath  (20  bits),  topology  datapath  (16  bits),  state 
probability  datapath  (18  bits),  phoneme  datapath  (5  bits)  and  flag  datapatii  (2  bits).  There  is  also  a 
counter  and  a  control  unit  containing  a  PLA-based  finite  state  machine. 

The  active-word  processor  is  physically  implemented  in  two  custom  chips.  This  is  neces¬ 
sary  because  the  pincount  of  the  processor  is  too  high  to  fit  into  one  package.  The  larger  chip  con¬ 
tains  the  state  machine  that  controls  the  operation  of  the  processor  and  the  datapaths  for  the  flags, 
wordarc  and  grammamode  probability.  The  smaller  chip  is  a  slave  to  the  larger  chip  and  contains 
all  the  other  daupaths. 

Some  additional  circuitry  is  included  in  the  chips  to  facilitate  testing.  To  aid  in  chip  testing, 
all  registers  are  scanpath  registers.  For  board  testing,  the  chips  interface  between  the  VME  bus 
and  the  two  memories  so  that  these  memories  may  be  written  and  read  by  the  VME  host 

The  chips  are  being  designed  using  the  LagerlV  silicon  assembly  system  and  vrill  be  simu¬ 
lated  using  IRSIM. 
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2.8.  Interface  Board  for  the  Robot  Control  System  (R.W.  Brodersen) 

This  project  involves  the  design  of  an  interface  board  for  the  robot  system.  The  board  inter¬ 
faces  the  various  relays,  motor  current  sensors  and  H-bridges  driving  the  robot  motors  with  the  '  « 
VME  board  running  the  ctmtrol  algorithms.  The  main  motivations  behind  the  design  of  this  board 
are  to  isolate  the  high-speed  digital  electronics  on  the  VME  board  from  the  noisy  high-power 
electro-mechanical  devices  in  the  robot  and  to  improve  the  mechanical  characteristics  of  the  sys¬ 
tem.  This  is  being  done  to  alleviate  the  electro-magnetic  interference  and  ground-bounce  prob¬ 
lems  experienced  in  our  current  implementation  where  the  interface  electronics  is  on  the  VME 
board  itself.  A  second  motivation  for  the  board  is  to  provide  current  feedback  to  the  H-bridges 
allowing  control  of  the  motor  torques,  as  opposed  to  the  voltages,  which  is  desirable  for  most 
sophisticated  robot  control  algorithms. 

The  board  consists  of  analog  filters,  comparators  and  pulse-width  modulation  circuits  to 
provide  current  feed-back  based  control  of  the  H-bridges  for  each  of  the  six  joints  of  the  robot 
arm.  In  addition,  six  analog-to-digital  converters  and  some  digital  circuits  allow  digital  genera¬ 
tion  of  the  pulse-width  modulated  signals  on  the  VME  board  itself  and  torque  control  through 
digital  current  feedback. 

Optical  isolation,  shielding  and  multiple  ground  and  power  planes  are  being  used  to  isolate 
the  VME  board  from  the  robot  and  to  reduce  ground  bounce  and  EMI  on  the  interface  board 
itself.  The  schematic  design  is  complete  and  the  board  place-and-routing  is  expected  to  be  dor^ 
in  a  few  weeks. 

Simultaneous  to  the  current  design  we  are  exploring  the  use  of  time-multiplexed  optical 
fiber  based  communication  between  the  VME  board  and  the  interface  board.  This  will  result  in 
much  increased  EMI  resistance,  mechanical  robustness  and  isolation.  Right  now  we  are  looking 
into  possible  techniques  and  the  implications  on  the  system  architecture  and  partitioning. 


2.9.  Design  of  Real-Time  Systems  with  Application  to  Robotics  (R.W.  Brodersen) 

The  goal  of  this  project  is  to  develop  a  CAD  framework  for  the  design  of  dedicated  real¬ 
time  systems  starting  from  a  high  level  structural  description.  In  particular,  we  are  interested  in 
real-time  feedback  systems  that  interaa  with  the  outside  world  through  a  variety  of  asynchronous 
events.  A  real-time  multi-sensory  robot  control  system  is  being  used  as  the  driver  applicatioa 

To  study  first  hand  the  issues  involved  in  such  systems,  a  manual  design  of  a  first- 
generation  robot  control  system  was  done  the  last  year.  A  DSP32C-based  custom  VME  slave 
board  forms  the  heart  of  the  system  and  is  fully  operational.  Our  experience  indicated  that  simu¬ 
lation  and  the  integration  (communication  and  synchronization)  of  the  various  software  and 
hardware  modules  residing  on  different  custom  boards  or  general  purpose  CPUs  forms  the  most 
difficult  part  of  the  design  phase. 

Making  use  of  our  experience  we  are  currently  exploring  a  suitable  CAD  framework  for  the 
design  of  such  systems.  Our  experience  suggests  that  a  good  way  of  describing  such  complicated 
systems  is  to  view  them  as  a  static  network  of  concurrent  processes  interacting  with  each  other 
through  events  and  messages.  The  processes  can  be  implemented  as  software  running  on  general 
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purpose  CPUs  or  as  dedicated  VLSIs.  More  than  one  process  may  be  mapped  on  to  a  single  gen¬ 
eral  purpose  CPU  or  ctistom  VLSI.  Such  a  description  can  be  viewed  as  a  type  of  block  diagram. 
Currently  we  are  looking  into  representation  in  OCT  and  simulation  of  systems  described  in  such 
a  fashion. 

Next  we  plan  to  work  on  mapping  such  a  description  into  a  library  of  hardware  and  software 
modules  and  automatically  generating  the  *glue*  software  and  hardware  to  handle  the  communi¬ 
cation  and  synchronizatioiL  Our  approach  is  based  on  defining  a  generic  architecture  for  multi¬ 
board  systems,  which  consists  of  multiple  slave  boards  on  a  VME  backplane  that  have  standard¬ 
ized  hardware  and  software  interfaces  to  a  VME  master  CPU.  The  slave  boards  are  custom 
boards  implementing  complete  subsystems  or  are  off-the-shelf  CPU  cards.  Each  custom  board  is 
in  turn  based  on  a  core  architecture  consisting  of  a  board-controller  CPU  and  multiple  slave 
modules.  The  slave  modules  come  ft'om  an  application-domain-specific  library  and  can  be 
software  for  a  general-puipose  DSP  or  microprocessor  or  behavioral  or  structural  description  of 
an  ASIC. 

We  intend  to  use  the  CAD  framework  to  describe,  simulate  and  design  our  next-generation 
robot-control  board  which  will  have  more  compute  power  than  the  current  board  and  ability  to 
handle  force  sensor  data. 


2.10.  Backtrace  Memory  ^  ^  asor  for  a  Real-Time,  Large-Vocabulary,  Continuous- 
Speech  Recognition  Syst<•'^"  r.'w.W.  Brodersen) 

A  crucial  part  of  the  proposed  real-time  continuous-speech  recognition  system  [9]  is  the 
backtracing  algorithn  to  recover  the  state  sequence  through  the  grammar  model  after  a  whole 
sentence  has  been  spoken.  Since  this  sequence  is  not  known  until  the  last  frame  of  a  semence  has 
been  processed,  all  the  possible  word  sequences  have  to  be  stored  in  a  memory  (backtrace 
memory).  Each  entry  in  this  memory  contains  the  identification  of  a  particular  grammamode 
along  with  an  address  that  points  to  the  predecessor  of  this  grammamode.  Since  there  is  no  end- 
of-word  detection,  each  grammamode  needs  an  entry  in  this  memory  for  every  timeframe.  Thus, 
the  stored  predecessor  of  a  grammarrK)de  is  in  most  cases  the  grammamode  itself. 

To  avoid  storing  too  much  backtrace  information,  the  backtrace  memory  processor  only 
stores  a  grammamode  in  the  backtrace  memory  if  the  probability  associated  with  this  grammar- 
node  is  higher  than  a  threshold  probability.  The  processor  also  computes  this  pruning  threshold 
based  on  a  mrming  maximum  probability.  The  running  maximum  can  be  preset  at  the  start  of 
each  frame  to  avoid  initialization  effects. 

The  processor  is  a  pure  slave  processor  to  the  Viterbi  processor.  It  is  not  driven  by  the  sys¬ 
tem  clock;  a  strobe  generated  by  the  Viterbi  processor  starts  a  pruning  and  threshold-updating 
operation  after  a  word  has  been  processed. 

Because  of  this  asynchronous  behavior  and  the  low  computational  rate  (once  for  each  gram¬ 
mamode),  a  standard-cell  design  methodology  was  used.  The  backtrace  memory  processor, 
which  was  fabricated  in  2-pm  CMOS  technology,  has  a  die  size  of  3044  ^m  x  3574  pm,  7092 
transistors,  and  94  signal  pins.  It  is  packaged  in  a  108-pin  pin-grid  array.  All  layout  was  gen¬ 
erated  using  the  LagerlV  silicon  assembly  system. 
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2.11.  ASIC’s  for  Numerical  Computations  (R.W.  Brodersen) 

The  traditional  problem  domain  for  digital  signal  processors  (DSPs)  has  been  operations 
such  as  filtering,  equalization,  and  spectrum  analysis.  This  project  attempts  to  extend  the  boun¬ 
daries  of  DSP  ai^lications  to  include  more  numerical  computation  problems. 

As  a  development  vehicle  we  are  considering  the  problem  of  finding  all  solutions  of  a  sys¬ 
tem  of  polynomial  equations  in  n  variables.  This  problem  is  irregular  and  complex  compared  to 
traditional  digital  signal  processing,  but  also  computationally  intensive. 

Our  current  work  focuses  on  continuation  methods  [4].  The  main  problem  is  the  need  to 
solve  large  numbers  of  n  by  n  linear  equations.  Whether  it  will  be  feasible  to  perform  these  com¬ 
putations  in  fixed-point  arithmetic  is  not  yet  clear.  For  efficiency,  it  may  also  be  necessary  to 
develop  a  dedicated  architecture  for  the  problem. 


2.12.  ASIC’s  for  Inverse  Kinematics  (R.W.  Brodersen) 

We  are  using  the  LagerlV  silicon  assembly  system  to  design  an  application-^cific 
integrated  circuit  (ASIQ  for  high-speed  inverse  kinematic  computation  for  elbow-type  robots 
such  as  the  Puma  560  and  the  Panasonic  NM-6740. 

The  goal  is  to  produce  a  circuit  that  provides  solutions  at  the  sample  rate  (which  may  be  up 
to  1  or  2  kHz)  of  the  robot  control  loop.  Such  a  circuit  would  make  it  possible  to  plan  detailed 
trajectories  in  Cartesian  space  and  feed  the  corresponding  angle  data  directly  into  the  control 
loop. 

The  algorithm  is  more  complex  and  less  structured  than  those  commonly  used  in  digital  sig¬ 
nal  processing.  In  attempting  to  design  an  efficient  and  accurate  fixed-point  version  of  the  algo¬ 
rithm,  we  have  had  to  develop  efficient  implemcniatioris  of  elementary  functions  (atan2,  sin,  cos, 
and  sqrt)  for  the  LagerlV  processor  architecture.  We  have  also  identified  and  designed  improve¬ 
ments  to  the  processor  architecture  for  the  inverse  kinematics  tasL 

This  project  has  been  the  first  to  use  the  RL  high-level  programming  language  to  program  a 
LagerlV  processor.  The  RL  compiler  [5]  translates  RL  directly  into  symbolic  microcode.  The 
retargeting  ability  of  the  compiler,  which  allows  experimentation  with  several  variants  of  the  tar¬ 
get  architecture,  has  proved  extremely  useful  for  evaluating  enhancements  to  the  architecture. 

New  developments;  The  algorithmic  problems  have  been  solved,  and  our  new  facility  for 
automatic  assembly  of  THOR  simulation  models  has  been  used  to  run  complete  simulations  of 
the  RL  program.  The  results  have  been  verified  and  compared  to  higher  level  simulations.  The 
latest  design  improvements  (sec  below)  have  been  incorporated,  and  the  final  chip  core  has  been 
successfully  simulated  with  both  THOR  and  IRSIM.  We  are  currently  woridng  on  fitting  the  core 
with  pads.  After  a  final  from-the-pads  simulation,  the  chip  will  be  sent  to  fabrication. 


2.13.  A  Programmable  DSP  for  LagerlV  (R.W.  Brodersen) 


The  Kappa  processor  architecture  was  developed  by  Khalid  Azim  [6]  for  the  LagerUI  sili¬ 
con  assembler.  We  have  ported  the  design  to  LagerlV  and  enhanced  the  architecture  by  including 
a  newly  designed  logarithmic  shifter  and  an  array  multiplier. 

It  is  possible  to  generate  different  versions  of  the  processor  by  selecting  parameter  values 
such  as  wordlengths,  memory  size,  and  microcode. 

This  process  is  automatic,  driven  by  parameter  values.  It  is  also  possible  to  customize  such 
characteristics  as  the  number  of  registers,  multiplier  type  (array  or  shift/add),  shifter  type  (vari¬ 
able  shift  from  register  vs  fixed  shift  from  instruction),  and  interconnections  (buses).  These 
changes  require  some  manual  work  by  the  user. 

Kappa  can  be  programmed  in  the  RL  high-level  programming  language.  [7]  special-purpose 
architecture  of  Kappa.  We  have  interfaced  the  standard  version  of  Kappa  to  the  RL  microcode 
compiler  [13]  by  providing  a  new  Microcode  assembler  (Mass). 

Our  effort  to  automatically  assemble  THOR  simulation  models  of  any  given  instance  of  the 
Kappa  processor  is  now  sucessfuUy  completed.  This  has  made  it  possible  to  simulate  the  execu¬ 
tion  of  complex  RL  programs  directly  on  a  logic  level  model  of  the  processor,  and  at  a  reastmable 
speed  (10  processor  cycles  per  epu  second).  This  is  more  than  an  order  of  magnitude  faster  than 
layout  extraction-based  IRSIM  simulations. 

The  access  to  speedy  THOR  simulations  has  prompted  a  more  thorough  testing  of  the  archi¬ 
tecture,  and  some  fatal  bugs  in  the  original  processor  design  have  been  corrected. 

The  internal  control  of  Kappa  has  been  redesigned  to  achieve  higher  layout  density.  This 
involved  changes  in  the  controller  architecture,  as  well  as  design  of  new  leafcells.  The  control 
unit  is  now  35%  smaller  than  before  and  close  to  optimal  for  our  largest  test  example. 

The  latest  revision  of  the  design  has  been  brought  through  entire  process  from  THOR  simu¬ 
lation  to  layout,  extraction  and  IRSIM  simulation.  The  functionality  of  the  new  design  has  been 
verified. 


2.14.  Board-Level  System  Interfacing  (R.W.  Brodersen) 

At  the  board  level,  where  chips  are  the  building  blocks  and  are  interfaced  with  extra  logic, 
the  development  of  board  design  techniques  lags  behind  the  development  for  VLSI  chip  design. 
To  reduce  board  design  complexity  and  to  ensure  functionality  and  required  performance,  this 
research  aims  to  automate  the  design  of  interface  logic  and  implementation.  The  chip  com¬ 
ponents  may  use  asynchronous  or  synchronous  communication  protocols  like  handshaking  to 
transfer  binary  data  between  them.  Given  the  protocols  and  the  system  architecture,  i.e.  commun¬ 
ication  topology  of  of  who  talks  to  whom,  the  goal  of  the  interface  tool  is  then  to  generate  circuits 
that  properly  control  protocol  event  sequencing  and  ensure  that  timing  requirements  arc  met  dur¬ 
ing  communications.  This  interface  synthesis  system  provides  high-level  user  input  specification, 
a  library  database  of  components,  and  the  actual  synthesis  tools. 
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The  input  from  the  user  is  a  block  diagram  of  the  system  architecture.  The  block  diagram  is 
made  up  of  blocks  which  represent  components  or  modules  of  components,  and  interconnection 
nodes  and  directed  edges  which  represent  interfaces  and  the  logical  communication  channel  that 
connects  the  module  ports,  respectively.  The  interface  synthesis  program  can  look  up  infonnation 
necessary  about  the  component’s  I/O  timing  behavior,  as  its  protocol  defines,  which  resides  in  a 
component  library.  The  block  diagram  is  represented  as  a  netlist  in  the  .sdl  format 

The  underlying  library  consists  of  components  and  information  about  tireir  I/O  timing 
behavior.  I/O  timing  behavior  normally  contains  many  details  about  the  sequencing  of  I/O  events 
and  timing  relationships  between  events.  These  details  are  tedious  for  the  user  to  enter  for  each 
design,  and  the  library  is  provided  as  a  convenience  to  the  user  and  to  maintain  integrity  of  the 
timing  behavior  infonnation.  The  timing  behavior  is  represented  as  a  signal-transition-gr^h[l] 
which  has  event  nodes  and  precedence  edges.  Timing  constraints  between  events  are  attadied  aS 
weights  to  the  graph  edges.  Currently,  this  research  is  exploring  representation  with  the 
operation/event-graph  model[8],  which  allows  complex  control  over  event  sequencirtg.  The 
graphs  are  stored  in  the  database  as  a  netlist  of  nodes  and  edges  represented  in  a  textual  fonnat, 
which  is  currently  being  developed. 

The  interface  synthesis  tool  reads  from  the  library  the  timing  behavior  of  the  appropriate 
components.  Then  given  the  communication  topology  from  the  block  diagram,  it  can  generate  the 
interface  logic.  The  first  step  is  merging  the  event-graphs  of  individual  component  ports  that 
communicate  into  one  event-graph  that  represents  the  I/O  timing  behavior  of  the  interface.  The 
algorithm  that  merges  the  graphs  will  consider  different  ways  of  merging  the  graphs  to  get  dif¬ 
ferent  performance  and  cost  (of  the  interface  circuit)  results.  In  the  next  step,  the  synthesis 
software  will  include  graph  check  programs  which  detect  consistency  errors  or  enforce  a  particu¬ 
lar  graph  property  to  hold  prior  to  actual  synthesis.  The  goal  here  is  to  guarantee  correemess  of 
the  logic  implementation  from  the  graph  specificationfl].  Thereupon,  a  functionally-correa 
merged  graph  specification  can  be  transformed  into  a  state  graph  from  which  logic  synthesis  can 
be  performed  using  developed  techniques[9]. 


2.15.  Wideband  Digital  Portable  Communications  (R.W,  Brodersen) 

The  purpose  of  this  research  is  to  investigate  and  develop  techniques  for  digital  wireless 
communications  capable  of  handling  extremely  high  data  rates,  yet  in  a  small,  portable  device. 
Qearly,  the  desire  for  portability  presents  the  challenge  of  low  power  consumption  within  the 
system,  while  the  desired  data  rates  CIO  Mbit/sec)  present  the  conflicting  goals  of  wide 
bandwidths  and  high  data-processing  rates.  We  envision  such  a  device  to  serve  as  a  terminal  of  a 
"micro-cellular''  system,  similar  in  spirit  to  today’s  cellular  phone  network,  but  with  a  fully  digi¬ 
tal  implementation,  capable  of  handling  information  transfers  well  beyond  simple  voice  transmis¬ 
sion,  including  full-motion  video.  The  microcell  concept  allows  for  these  bandwidth  require¬ 
ments,  allowing  high  frequency  reuse  factors  and  correspondingly  high  spectrum  efficiency,  as 
well  as  keeping  short  transmission  distances  between  mobile  and  base  units. 

We  are  now  beginning  to  examine  channel  and  data  encoding,  modulation,  and  detection 
schemes  which  are  optimal  within  these  constraints,  through  extensive  use  of  computer-aided 
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modeling.  Furthermore,  channel  noise,  low-antenna  multipath  propagation,  and  fading  issues  in  a 
microcell  transmission  environment  must  be  considered.  Currently,  any  such  modelling  must  be 
done  using  field-measured  fading/distortion  data;  we  are  developing  a  computer-aided  simulation 
program  which  takes  "physical  descriptions"  of  the  terrain  (eg.,  placement  of  buildings,  trees,  etc) 
and  provides  reasonably  accurate  dcta  concerning  the  transmission  channel. 

As  a  first  iteration  in  developing  such  a  communications  system,  we  are  constructing  a  10 
Mbit/sec  wireless  data  link,  which  would  be  compatible  with  current  Ethernet  standards.  Hie 
most  promising  candidate  thus  far  is  a  time-division  multiple  access  system,  using  16-  or  32-  qua¬ 
drature  amplitude  modulation  with  Viterbi  error  coding. 


2.16.  Oct2PCB  Placement  and  Routing  within  LagerlV  (R.W.  Brodersen) 

Until  recently,  the  LagerlV  assembly  system  addressed  only  the  design  of  custom  integrated 
circuits.  However,  most  real-time  systems,  including  those  we  are  developing,  incorporate  com¬ 
modity  and  programmable  components,  as  well  as  custom  devices  on  the  same  board.  We  have, 
therefore,  extended  the  toolset  of  the  LagerlV  system  to  handle  the  design  of  printed  circuit 
boards  (PCBs).  We  have  developed  interfaces  between  the  Oct  database  environment  and  a  set  of 
commercial  PCB  tools,  such  as  those  offered  by  Hewlett-Packard  and  Racal-REDAC,  which 
allow  us  to  describe  the  composition  of  the  board  using  the  sdl  structural  language  exactly  as  we 
would  '!“.scribe  the  composition  of  a  custom  integrated  circuit.  This  description,  which  might 
include  s  me  constraints  on  the  board  topology,  is  then  passed  to  the  commercial  tools  for  place¬ 
ment  anci  routing. 

'Current  efforts  focus  on  enhancing  the  modularity  and  features  of  the  PCB  description  and 
adding  tools  for  board  extraction,  modeling  (including  connectors  and  backplanes),  and  electrical 
simulation. 


2.17.  Techniques  for  Very  Fast  System  Prototyping  (J.M.  Rabaey,  R.W.  Brodersen) 

Prototyping  a  system  is  often  the  only  way  to  test  its  correctness  and  optimize  its  behavior 
with  respect  to  the  initial  specifications.  System  prototyping  has  traditionally  been  an  extremely 
tedious  operation,  using  commodity  components  such  as  microprocessors  and  memories  com¬ 
bined  with  a  lot  of  glue  logic  implemented  in  TTL. 

Recent  advances  in  technology  have  yielded  several  components  and  techniques  that  could 
reduce  the  prototyping  time  dramatically.  One  is  the  sea-of-gates  approach,  which  allows  the 
integration  of  systems  of  considerable  complexity  in  a  moderate  turnaround  time.  Another 
approach  involves  the  use  of  Programmable  Logic  Devices  (PLDs),  whose  complexity  has 
recently  risen  to  9000  gates.  This  approach  offers  great  flexibility  and  instantaneous  turnaround. 

In  this  project,  we  have  analyzed  the  use  of  PLDs  for  prototyping  signal-processing  applica¬ 
tions  such  as  speech  recognition  and  image  processing.  The  results  of  those  experiments  have 
allowet  us  to  classify  existing  PLD  devices  according  to  their  complexity,  flexibility,  and 
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architectural  features.  This  information  will  help  us  to  develop  tools  for  automatic  partitioning 
and  architecture  selection. 

We  have  implemented  an  interface  between  the  Lager/Oct  environment  and  commercially 
available  PLD  mapping  tools.  We  have  also  developed  a  method  for  mi^ping  logic  equations 
into  PLD  primitives  using  MIS-II  and  Espresso.  To  map  large  designs  efficiently  into  a  set  of 
PLD’s,  we  have  developed  two  partitioning  methods.  One  method  uses  a  fast  clustering  algo¬ 
rithm.  Another  method  uses  a  linear  programming  approach  to  find  an  optimal  solution.  With 
this  interface,  it  will  be  possible  to  generate  PLDs  starting  from  our  standard  specification 
environment  (sdl  VEM)  and  to  compare  different  implementation  techniques  (such  as  custom¬ 
cell.  gate-array,  or  PLD)  relatively  quickly.  We  are  currently  targeting  PLD’s  manufactured  by 
Aael,  Altera,  and  Xilinx. 

We  have  used  this  interface  to  map  designs  used  on  a  32  megabyte  memory  board  of  our 
HMM  speech  recognition  project.  These  designs  include  a  memory  controller.  VME  interface 
logic,  address  generation  logic,  and  a  saturating  adder. 


2.18.  Fast  Prototyping  of  Video  and  Speech  Systems  (J.M.  Rabaey) 

The  implementation  and  realization  of  video  and  speech  systems  is  always  cumbersome, 
requiring  extensive  time  and  resources.  The  high  performance  constraints  force  the  designer  to 
use  either  a  bulky  TTL  board  or  custom-designed  chips,  both  of  which  are  expensive. 

This  project  attempts  to  speed  up  prototyping  by  defining  a  library  of  macrocomponents  at  a 
high  enough  level  to  simplify  the  board  design  but  still  flexible  enough  for  a  wide  variety  of 
high-perfomtance  signal-processing  applications.  To  define  the  contents  of  the  library  and  the 
desired  programmability  and  functionality,  we  are  currently  examining  a  set  of  image-  and 
video-processing  algorithms  as  well  as  some  problems  in  speech  recognition.  These  include 
recursive  filters,  matrix  conversion  irom  RGB  to  luminance  and  chrominance  for  video,  linear 
and  non-linear  image  filtering  using  linear  convolution  and  sorting  algorithms  respectively, 
dynamic  lime  warping  and  Hidden  Markov  based  algorithms  for  speech  recognition,  and  others. 

We  have  observed  that  most  of  these  systems  are  implemented  as  a  set  of  concurrently 
operating,  bitsliced  and  pipelined-processors.  The  connection  and  communication  patterns,  the 
controller  structure,  the  datapath  composition,  and  the  memory  organization  of  the  processors, 
however,  depends  heavily  upon  the  application.  For  each  entity,  we  are  trying  to  define  a  res¬ 
tricted  set  of  programmable  components  that  covers  most  of  the  architectural  alternatives.  Four 
classes  of  devices  are  necessary:  datapath  blocks,  controllers,  memory  (including  delay  lines), 
and  interprocessor  commuiucaticn  units. 

Fairly  efficient  solutions  are  available  for  control  structures  (using  programmable  logic  dev¬ 
ices)  and  memory  structures.  However,  no  high-level  re-programmable  datapath  or  interproces¬ 
sor  communication  structures  are  yet  available.  We  are  trying  to  find  ways  to  realize  high-level 
datapath  and  communication  components  using  both  laser  and  software-programmable  on-chip 
interconnect  and  multiple  buffering  register  files. 

Several  key  issues  have  been  identified  for  the  datapath  blocks:  granularity  of  the  process¬ 
ing  elements  (PE’s),  choice  of  the  operator(s)  for  the  PEs,  interconnectivity  between  PEs, 
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communication  of  control  flags  between  PEs  and  the  controller,  datapath  widths,  and  I/O 
bandwidth.  The  prototype,  dubbed  PADDI  for  Programmable  Arithmetic  Devices  for  Digital 
Signal  Processing,  has  been  designed  to  cover  a  range  (1-10)  of  sampling-interval-to-clock  ratios 
to  handle  lowly  multiplexed  to  highly  multiplexed  datapaths.  PADDI  contains  8-bit  linkable  exe¬ 
cution  units,  a  sutically  programmable  hierarchical  interconnect  network,  and  a  hierarchical  con¬ 
trol  mechanism.  The  initial  design  envisions  a  central  external  controller  chip  that  implements 
the  control  flow  of  the  algorithms  and  broadcasts  the  state  to  different  execution  units,  which  may 
reside  on  different  chips.  Fast  programmable  state  sequencers  are  commercially  available  for  this 
purpose.  A  local  nanostore  determines  the  operations  to  be  performed  on  a  particular  datapath. 
Both  the  nanostore  and  the  configuration  memory  for  the  interconnection  network  can  be  loaded 
serially  upon  initialization.  We  anticipate  that  the  nanostore  will  also  be  able  to  provide  a  next 
state  address  for  branching.  We  are  currently  in  the  process  of  defining  the  micro-architecture  and 
VLSI  floor-planning  for  PADDI. 

Because  a  hardware  synthesis  and  verification  environment  will  be  crucial  to  the  success  of 
fast  prototyping,  this  project  also  involves  developing  software  tools  to  map  a  system  definition 
into  the  hardware  macrocomponent  library  and  automatically  generating  fimctional  and  electrical 
simulation  models. 


2.19.  HYPER  -  An  Interactive  Synthesis  Environment  for  High-Performance  Real-Time 
Applications  (J.M.  Rabaey) 

The  goal  of  this  project  is  to  build  an  interactive  environment  for  the  design  of  high- 
performance  real-time  processors,  consisting  of  the  following  major  elements: 

(1)  The  algorithm  to  be  implemented  must  be  specified  either  graphically  or  textually.  We 
have  extended  Silage  [1]  to  describe  control  constructs  for  register-level  descriptions  and 
developed  a  graphic  front-end  based  on  VEM/RPC. 

(2)  From  the  input  specification,  a  mixed  signal-flow/control-flow  graph  is  derived.  The 
undecoratcd  flow  graph,  generated  by  the  parser,  is  then  passed  to  a  scheduling  and  optimi¬ 
zation  pass.  These  techniques  are  described  in  [10]. 

(3)  From  the  resulting  scheduled  graph,  we  can  synthesize  the  datapaths,  the  controller,  and  the 
interface  logic.  AU  these  steps  require  accurate  information  about  the  available  cell  library 
(speed,  area,  black-box  view,  and  functionality);  this  information  is  provided  by  a  itile- 
based  library  database.  The  final  layouts  of  the  processors  arc  generated  vrith  the  Lager  IV 
system.  A  program  that  translates  a  decorated  flow-graph  description  to  the  SDL  (structure 
description  language)  has  been  developed. 

The  hardware  mapping  process  consists  of  three  routines  that  translate  the  datapaths,  the 
controller,  and  the  interface  logic.  The  hardware  mapping  of  datapaths  requires  a  suite  of 
transformation  steps,  including  multiplexer  reduction,  hardware  choices  of  assign  operations,  and 
data  path  partitioning.  The  details  of  these  operations  can  be  found  in  [3].  Several  trandation 
steps  are  also  introduced  to  the  system  to  handle  signal  broadcasting  and  arithmetic  commuta¬ 
tivity. 
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Datapath  partitioning  is  based  on  three  criteria.  First,  a  depth  first  search  is  performed 
through  the  hardware  graph.  Each  group  is  further  divided  if  different  word  lengths  are  found 
within  the  group.  If  the  number  of  blocks  in  a  group  is  still  too  large,  a  simulated  aimealing- 
based  algorithm  is  used  for  solving  the  min-cut  problem. 

The  control  path  of  a  processor  can  also  be  derived  from  the  decorated  flow-grt^h.  First,  a 
state-transition  diagram  is  generated  from  the  scheduling  information;  fiien  the  transition  diagram 
is  optimized  by  removing  the  dummy  states.  The  controller  structure  is  generated  in  this  stq). 
Several  optimizations  are  performed  to  reduce  the  size  of  the  controller  and  to  simplify  the  wiring 
between  the  control  path  and  the  datapath:  these  include  recognizing  the  control  signals  that  are 
independent  of  control  states,  merging  equivalent  or  complementary  signals,  and  allocating  tte 
miiumum  number  of  control  registers  by  lifetime  analysis. 

Several  extensions  of  the  system  have  been  made  since  last  proposal.  First,  Control  dices 
between  datapaths  and  central  control  are  partitioned  according  to  the  partitioning  of  the  data¬ 
paths.  This  greatly  improved  the  layout  quality.  Logic  optimization  is  dso  performed  on  the 
control  slices  to  reduce  redundant  logic.  Another  extension  is  that  bus  structure  of  control  signals 
are  allowed  to  handle  cases  such  as  log  shifters  and  register  files. 

We  have  been  able  to  generate  the  layouts  of  a  7th  order  IIR  filter  from  Silage  description 
and  use  THOR  to  simulate  the  functionality  of  the  structure  description  generated  by  HYPER. 
Several  examples,  including  the  epsilon  processor  and  the  Viteibi  processor  of  our  Hidden  Mar¬ 
kov  Model  (HMM)  speech  projects,  have  also  been  generated  from  the  flow-graph  description. 
We  can  easily  study  the  tradeoffs  of  different  scheduling  and  hardware  allocation  schemes  by 
using  HYPER.  Future  research  will  involve  the  estimation  and  hardware  allocation  scheme  of 
HYPER,  introducing  more  translation  and  transformation  routines,  and  rewriting  the  database 
system  in  OCT  to  improve  performance. 


2.20.  High-Quality  Speech  Coding  for  Portable  Communications  (J.M.  Rabaey) 

This  research  focuses  on  the  evaluation  of  high-quality  speech  coding  algorithms  and  will 
result  in  an  implementation  for  use  in  a  portable  communications  system.  The  evaluation  phase 
consists  of  analysis  and  simulation  of  several  candidate  algorithms.  High-quality  and  relatively 
low  bit-rate  (’1 6  Kb/s)  operation  suggest  a  robust  CELP  (Code-Excited  Linear  Prediction)  algo¬ 
rithm.  In  CELP  coding,  a  search  algorithm  periodically  selects  an  optimum  excitation  sequence 
from  a  vector  quantized  code-book.  An  index  to  this  vector  is  then  transmitted  and  used  to  excite 
a  synthesis  filter  in  the  decoder.  The  decoder  filter  parameters  are  then  updated  in  a  backward 
adaptive  fashion. 

Since  this  coder  must  function  in  a  portable  environment,  low  power  dissipation  is  an  over¬ 
riding  concern.  Suitability  for  a  low-power,  real-time  implementation  will  undoubtedly  influence 
algorithm  selection  and  customization.  Indeed,  much  of  this  study  will  focus  on  analyzing  and 
developing  VLSI  architectures  and  design  styles  suitable  for  low-power  operation.  This  analysis 
based  on  fundamentals  of  complexity  theory,  statistical  analysis,  and  simulation  should  lead  to 
design  styles  and  methodologies  suitable  for  low-power  implementations.  The  research  is 
relevant  not  only  for  this  particular  application,  but  also  for  a  wide  range  of  DSP  algorithms 


constrained  to  low-power  operation. 


221.  Partitioning  DSP  Algorithms  onto  Multiprocessors  with  Configurable  Interconnec¬ 
tion  (J.M.Rabaey) 

Multiple  programmable  digital  processors  are  used  to  handle  the  computationally  intensive 
behavioral  simulation  of  DSP  algorithms.  To  minimize  the  communication  oveihead,  the  proces¬ 
sors  are  connected  by  a  bus  that  can  be  configured  by  software  to  match  the  communication  pat¬ 
tern  of  each  particular  algorithm.  To  minimize  the  idling  time,  an  efficient  heuristic  to  balance 
processor  loads  is  being  investigated.  This  algorithm  uses  estimates  of  the  computation  time  and 
communication  time  of  the  individual  atomic  tasks  to  optimally  partition  the  program  onto  the 
processors.  Inrst,  pipelining  and  retiming  at  the  block  level  will  be  performed.  Then,  within  the 
sub-block,  parallelism  or  further  pipelining  can  be  used  to  exploit  the  finer  grain  concurrency. 

The  ultimate  goal  of  this  project  is  to  provide  an  environment  where  users  can  express  DSP 
algorithms  textually  or  graphically,  and  have  a  compiler  partition  and  translate  the  program  into 
sections  of  assembly  code  to  be  executed  on  the  multiprocessor  system.  The  compiler  will 
configure  the  bus  to  minimize  the  communication  overhead. 

A  prototype  of  the  multiprocessor  system  (SMART)  has  been  built  It  consists  of  10 
DSP32C  Digital  Signal  Processors  as  well  as  custom  VLSI  chips  to  handle  the  communication 
and  synchronization  between  processors. 

The  Silage  To  SMART  (S2S)  Compiler,  wtuch  implements  the  partitioning  algorithm,  is 
under  development.  It  is  composed  of  4  tasks: 

1  Silage  to  Flowgraph  (S2F)  Trartslation 

2  Flowgraph  Partitioning 

3  Flowgraph  to  C  (FF2C)  Translation 

4  C  to  DSP32C  Code  Compilation 

The  Silage  to  flowgraph  translation  is  complete.  The  flowgraph  partitioning  performs  pipe¬ 
lining,  retiming,  and  parallelism  simultaneously  under  one  global  search  strategy.  The  algorithm 
also  automatically  break  the  nodes  of  the  graph  to  the  appropriate  level  of  granularity  while  parti¬ 
tioning.  Initial  results  from  the  flowgraph  partitioning  are  very  good.  Further  research  are 
needed  in  modeling  communication  costs  between  processors  accurately,  as  well  as  in  handling 
multirate  DSP  applications. 

The  first  version  of  the  flowgraph  to  C  translation  has  been  implemented,  which  can  execute 
on  Sun  Workstations.  The  ultimate  goal  is  to  use  AT&T’s  C  compiler  for  the  DSP32C  to  gen¬ 
erate  code  for  the  processors.  To  yield  efficient  code,  an  in-depth  study  of  the  characteristics  of 
the  C  compiler  will  be  done. 

2.22.  SMART:  Switchable  Multiprocessor  Arhitecture  with  Real-Time  Support  (J.M. 
Rabaey) 
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A  major  part  of  the  design  effort  for  DSP  systems  is  devoted  to  algoridimic  specification 
and  verification.  Executing  these  simulations  on  a  general-purpose  computer  requires  so  much 
CPU-time  that  real-time  simulations  of  certain  algorithms  are  impossible.  The  main  purpose  of 
diis  project  is  to  develop  a  dedicated  simulation  engine  at  least  two  orders  of  magnitude  faster 
dian  a  general-purpose  computer  architecture  of  the  same  technology  level.  To  handle  the 
number-crunching  bottleneck,  we  are  using  a  floating-point  DSP  processor  (the  DSP32C  from 
AT&T  Bell  Laboratories)  as  the  core  processor.  On  this  processor,  simulations  run  at  least  an 
order  of  magnitude  faster  than  on  a  general-purpose  microprocessor. 

An  additional  order-of-magnitude  gain  in  simulation  speed  can  be  obtained  by  exploiting 
the  high  degree  of  parallelism  and  pipelining  inherent  in  most  signal-processing  algorithms.  We 
have  proposed  a  multiprocessor  architecture  with  a  software-teconfigurable  communication  pat¬ 
tern  that  permits  the  processor  architecture  to  be  adjusted  to  match  the  concurrency  and  pipelin¬ 
ing  properties  of  the  algorithm. 

Two  208-pin  VLSI  chips  have  been  designed  and  tested  to  handle  communication  and  syn¬ 
chronization  between  the  processors  and  to  manage  memory  access.  A  prototype  of  the  system  is 
implemented  and  operational  at  peak  120  MFLOPS  with  10  processing  units.  Application 
software  programs  such  as  diagnostic  testing,  image  processing,  and  1024-points  FFT  were 
developed  to  measure  the  perfonnance  of  the  architecture.  Future  efforts  are  directed  to  the 
developments  of  more  application  programs,  the  custom  high-speed  I/O  interface,  and  the  perfor¬ 
mance  enhancement  of  th^  prototype  system  (200  MFLOPS)  for  same  number  of  processors. 


2.23.  Extended  THOR  (J.M.  Rabaey) 

THOR  is  a  functional  simulator  based  on  the  CSIM  simulator,  a  conventional  event-driven 
functional/behavioral  simulator.  To  describe  a  system  in  the  THOR  environment,  the  user  has  to 
provide  the  models  of  the  system  modules  and  their  interconnections.  The  modules  are  written  in 
a  language  called  CHDL  (C  Hardware  Description  Language).  CHDL  is  based  on  the  C  pro¬ 
gramming  language  with  added  features  for  hardware  modeling.  The  interconnection  network 
description  is  provided  in  a  component-oriented  net  list  language,  called  CSL. 

THOR  implements  some  of  the  features,  which  are  essential  in  a  heterogeneous  simulaticm 
environment 

Since  the  leaf  modules  in  THOR  are  described  in  standard  C,  it  is  possible  to  link  in  a 
variety  of  "foreign"  or  external  simulators,  sudi  as  a  SILAGE  or  a  Motorola  680(X)  simula¬ 
tor. 

The  THOR  simulation  engine  handles  the  intermodule  communications.  Unfortunately,  the 
present  THOR  is  based  solely  on  the  event  driven  protocol,  which  is  sufficient  for  structural 
simulation.  As  described  above  however,  a  system  level  simulation  environment  equally 
requires  the  data  flow  and  communication  processes  based  mechanisms. 

The  LAGER  silicon  assembly  system  makes  extensive  use  of  the  THOR  environment  for 
modeling  and  simulation  at  the  structural  level.  And  the  extension  of  the  THOR  environment, 
which  will  have  most  of  the  properties  required  for  heterogeneous  simulation,  becomes  essential. 
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The  CSL  language  will  be  replaced  by  the  sdl/OCT-SlV  descriptions  of  the  LAGER 
environment  This  will  improve  the  parameterizability  and  the  descriptive  power. 

The  introduction  of  foreign  simulators  will  be  simplified  by  providing  a  library  of  interface 
routines.  This  vrill  make  the  interconnection  mechanisms  transparent  to  the  user. 

The  simulation  engine  will  be  adapted  to  incorporate  a  variety  of  communication  mechan¬ 
isms. 

The  extended  THOR  will  handle  multi-processor  simulation  and  hard  simulation.  This  will 
once  again  be  achieved  by  providing  a  set  of  library-routines,  which  hide  the  i^ysical 
implementation  of  those  interfaces  to  the  user. 


2J24.  Behavioral  Transformations  for  the  Synthesis  of  High-Performance  DSP  Systems 
(J.M.  Rabaey) 

To  solve  a  given  computational  problem,  one  can  use  a  large  number  of  algorithms.  Often 
any  one  of  these  algorithms  can  lead  to  several  implementations,  each  with  vastly  different  time 
of  execution,  hardware  requirements,  power  constraints,  and  testability.  Since  tire  flow  gr^h 
specifled  by  the  designer  often  fails  to  meet  the  performance  specifications  or  results  in  an  infe¬ 
rior  realization,  optimizing  transformations  must  be  applied.  Most  of  the  behavioral  transforma¬ 
tions  are  well  known  from  optimizing  software  compilers;  they  include  constant  arithmetic,  com¬ 
mon  subexpression  elimination,  and  dead-code  elimination.  More  important  are  the  loop 
transformations;  loop  retiming,  loop  pipelining,  partial  or  complete  loop  unrolling,  and  loop  jam¬ 
ming.  These  latter  transformations  are  espe<^ially  suitable  real-time  systems,  in  which  each  pro¬ 
gram  contairts  an  infinite  loop  of  time  and  concurrency  can  be  exploited  more  efficiently  by  con¬ 
trolling  the  hardware  resources. 

We  are  implementing  a  search-driven  transformation  environment  where  the  order  and  type 
of  the  transformations  are  determined  by  the  table  of  hardware-utilization  ratios.  The  same 
environment  can  also  support  other  high-level  synthesis  tasks  such  as  module  and  clock  selection, 
partitioning,  pipelining,  design-style  selection,  assignment,  and  scheduling. 

We  started  implementation  of  transformations  by  implementation  of  basic  block  transfor¬ 
mation.  Those  include  commutativity,  associativity,  distributivity,  retiming,  constant  evaluation, 
pipelining  and  software  pipelining.  We  developed  new  algorithm  based  on  Welsh  randomized 
algorithm  which  apply  just  mention  transformations  in  such  way  that  we  can  schedule  resulting 
signal  flow  graph  in  a  minimum  amount  of  time  on  a  given  hardware  configuration.  The  first 
result  (we  completed  implementation  of  commutativity)  are  very  promising. 


2.25.  Scheduling  and  Resource  Allocation  in  the  Design  of  High-Performance  Digital  Sig¬ 
nal  Processor"  (J.M.  Rabaey) 

The  goal  of  this  project  is  to  minimize  the  total  hardware  cost  of  an  implementation  of  a  tar¬ 
get  program  represented  by  a  signal-flow  graph  (or  data-dependency  graph),  given  constraints  on 
execution  lime,  timing,  and  hardware.  The  hardware  cost  function  is  composed  of  the  cost  of  the 
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functional  units,  memory,  and  connectivity. 

An  overview  of  the  state  of  the  art  in  this  area  is  given  in  [1],  None  of  the  available 
approaches,  however,  allows  a  consistent  and  unbiased  treatment  of  the  three  contributions  to  the 
cost  function:  execution  units,  memory,  and  interconnect  Furthennore,  most  available  tech¬ 
niques  remove  all  hierarchies  from  the  flow  graph  before  scheduling:  functions  are  expanded,  and 
loops  are  uiuolled,  resulting  in  huge  graphs  for  most  problems.  We  have  been  developing  a  suite 
of  algorithms  that  address  these  deflciencies. 

The  new  technique  has  four  parts:  control  strategy,  estimation,  assignment  and  generation 
of  the  exact  solution  (scheduling). 

We  finished  implementation  of  the  new  assignment  and  scheduling  algorithm.  From  the 
algorithmic  point  of  view  there  are  three  new  features:  (i)  it  is  combination  of  probabilistic  and 
classical  constructive  algorithms;  00  the  probabilistic  algorithm  is  rejectionless;  and  (iii)  the 
classical  constructive  algorithm  is  based  on  the  idea  of  discrete  relaxation.  The  simple  extension 
is  making  incorporation  of  specific  transformation  (commutativity)  simple  and  efficient. 

We  show  excellent  results  on  a  number  of  examples.  Running  time  for  examples  with  up  to 
a  hundred  nodes  is  less  than  1  second.  We  introduce  a  broad  class  of  test  examples  and  the  idea  of 
estimation  which  makes  assessment  of  scheduling  and  assignment  algorithms  easier  and  more 
realistic.  Also,  we  demonstrated  on  those  examples  the  simple  and  efficient  approach  of  trading 
the  speed  of  scheduling  for  the  quality. 


2.26.  A  Hardware  Environment  for  Rapid  Prototyping  of  DSP  Systems  (J.M.  Rabaey) 

The  main  goal  of  this  project  is  to  provide  a  rapid  prototyping  hardware  environment  for 
designers  of  DSP  systems.  The  key  part  of  the  system  that  we  envision  is  a  dynamically 
configurable  network  with  high  communication  bandwidth.  In  addition  to  data  routing,  the  net- 
woik  also  supports  features  such  as  broadcasting  and  merging  data. 

During  the  early  phase  of  the  design  cycle,  designers  can  simulate  their  algorithms  by  plug¬ 
ging  processor  boards  into  our  system.  The  network  integrates  these  processors  together  to  forni  a 
multi-processor  simulation  engine  which  provides  high  computation  throughput  as  required  by 
many  DSP  algorithms.  The  netwoik  can  be  configured  according  to  the  communication  patterns 
of  the  algorithms  to  reduce  overheads  due  to  inter-processor  communication.  Heterogeneous  pro¬ 
cessors,  e.g.,  DSP’s  and  RISC’s,  can  be  used  to  suit  the  nature  of  the  computation.  As  the  system 
evolves,  ASIC’s  replace  certain  sub-systems  where  high  performance  is  desired  while  the  rest  of 
the  system,  such  as  some  front-end  processing,  can  still  be  emulated  by  the  processors. 

Our  system  shortens  design  time  by  providing  high  computation  throughput  and  program- 
mabUity  throughout  the  development  cycle  so  that  designers  can  get  fast  feedback  of  performance 
and  hardware  problems.  It  also  eases  the  integration  of  custom-designed  hardware  into  the  final 
system. 

Problems  that  we  have  to  address  include  system  extensibility,  I/O  requirements  of  the  sys¬ 
tem,  standardization  of  the  bus  protocol  and  the  design  of  the  communication  network. 


117.  Pulsar  Signal  Recovery  (J.M.  Rabaey) 


Pulsars  are  rotating,  highly  magnetized  neutron  stars  which  emit  sharp  pulses  with  high  sta- 
iHlity  of  frequency.  Pulsar  timing  has  a  number  of  applications.  However,  to  achieve  pulsar  tim¬ 
ing,  the  signals  which  have  been  dispersed  by  propogation  through  the  interstellar  medium,  have 
to  be  passed  through  a  de-dispersion  filter  on  the  receiver  end.  The  coherent  de-di^rsion  tech¬ 
nique  involves  implementing  the  inverse  interstellar  medium  transfer  fimction  in  an  FIR  filter. 
However,  to  de-disperse  the  entire  100  Mhz  bandwidth  of  the  signal  would  require  a  FIR  filter  of 
an  order  of  a  million  taps. 

Since  such  a  huge  FIR  filter  is  not  feasable,  the  current  research  is  in  making  an  FIR  filter 
which  would  implement  a  1000  tap  FIR  filter  which  would  de-disperse  about  1  Mhz  of  the  sigrud 
bandwidth.  Currently,  we  are  carrying  simulations  on  word  length  effects  on  the  signal  recovery. 
Once  the  word  lengths  are  decided  upon,  the  de-dispersion  filter  would  be  converted  to  a  VLSI 
chip. 


2.28.  Frigg:  A  Simulation  Environment  for  Multiple-Processor  DSP  Hardware  Develop¬ 
ment  (E.A.  Lee) 

In  the  last  six  months  we  completed  and  terminated  this  project,  the  construction  of  design 
aids  for  systems  using  commodity  programmable  DSPs.  We  view  this  as  a  promising  start  on 
better  integrated  design  environments  that  incorporate  heterogeneous  models  of  computation.  In 
this  case,  we  have  interfaced  the  hardware  design  environment  to  Gabriel,  which  synthesizes 
assembly  code  for  programmable  DSPs.  So  the  designer’s  view  of  the  system  simultaneously 
incorporates  a  dataflow  represenuuor  of  the  signal  processing  application  and  a  netlist  represen¬ 
tation  of  the  hardware  being  designci^  •£,  implement  that  application. 

Many  practical  DSP  problems  require  systems  with  multiple  programmable  DSPs. 
Developing  multi-processor  systems  is  complicated,  and  traditional  development  tools  do  not 
provide  adequate  support  While  simulators  exist  for  all  programmable  DSPs,  there  has  been  no 
ready  way  for  developers  to  simulate  the  interaaion  between  a  programmable  DSP  and  other 
digital  hardware,  or  between  multiple  (possibly  heterogeneous)  DSPs.  This  lack  of  flexible  simu¬ 
lation  capabilities  has  often  meant  long  delays  in  hardware  development  and  the  postponement 
of  software  testing  and  debugging  until  after  hardware  prototypes  are  built 

We  have  designed  and  implemented  a  simulation  environment  that  meets  the  needs  of 
developer  of  multi-DSP  systems.  Our  simulator.  Frigg,  uses  the  Thor  simulator  fran  Stanford 
University  as  the  simulation  substrate.  Thor  is  used  to  simulate  all  hardware  elements  other  than 
DSPs,  and  to  simulate  aH  interconnections  between  elements.  Frigg  integrates  the  capabilities  of 
Thor  with  those  of  manufacturer-supplied  DSP  simulators.  The  X  window  system  is  used  to  pro¬ 
vide  a  multi-window  user  interface  which  allows  a  user  to  easily  select  and  interact  with  different 
processing  elements.  Frigg  uses  the  interprocess  communications  GPC)  facUiiies  of  Unix  to  inter¬ 
face  independent  cooperating  simulation  processes,  which  can  rtin  on  different  hosts.  The  com¬ 
munication  mechanism  developed  is  general,  and  may  be  used  in  the  future  to  interconnect  a 
wide  variety  of  simulators  in  a  multi-mode  simulation  environment 
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Frigg  has  been  interfaced  to  Gabriel,  a  graphical  dau-flow  programming  system  (with  arbi¬ 
trary  granularity)  for  multiple  programmable  DSP  systems.  A  designer  can  create  a  new  architec¬ 
ture  by  specifying  the  Thor  models  and  creating  the  netlist  Then  those  properties  of  the  architec¬ 
ture  that  are  relevant  to  code  generation,  such  as  the  number  of  processors  and  the  interprocessor 
communication  medianism,  are  summarized  in  a  form  readable  to  Gabriel.  Gabriel  can  then  take 
any  application  program  and  produce  code  for  the  new  architecture.  An  attempt  has  been  made  to 
make  the  Gabriel  core  as  independent  as  possible  of  details  of  the  target  architecture,  but  more 
work  is  warranted  to  get  truly  easy  retargetability. 

Fmally,  we  have  assembled  a  laboratory  facility  for  experimentation  with  multiprocessor 
DSP  hardware  and  software.  Currently,  a  four-processor  DSPS6(XX)-based  system  is  running  in 
the  lab,  and  has  been  integrated  with  other  lab  resources.  The  prototype  system  was  donated  by 
Dolby  Laboratories,  a  San  Francisco  company.  The  system  has  been  interfaced  to  Gabriel,  using 
the  hardware  description  mechanism  described  above.  We  have  demonstrated  that  Gabriel  can 
automatically  synthesize  code  for  the  system  without  requiring  that  the  user  have  any 
architecture-specific  knowledge. 


2.29.  Heterogeneous  Hardware  Targets  (E.A.  Lee) 

The  Gabriel  software  system  is  capable  of  running  local  simulations  of  a  signal  processing 
system  on  a  workstation,  or  generating  assembly  code  for  real-time  execution.  Unfortunately,  at 
this  time,  the  designer  must  choose  between  these  two  modes  of  operation,  rather  than  a  more 
natural  mode  that  would  allow  a  mixed  system  that  involves  interaction  between  the  workstation 
and  the  real-time  DSP  hardware.  The  next  generation  software  environment  (called  Ptolemy, 
described  separately)  will  support  such  mixed  systems  naturally  by  supporting  cooperation 
between  distinct  scheduling  models.  The  aim  of  this  projea  is  to  determine  how  to  use  this  new 
capability  to  automatically  target  heterogeneous  hardware  architectures.  The  first  such  target  will 
be  a  workstation  with  a  programmable  DSP-based  board  on  its  bus.  An  application  program  will 
simultaneously  specify  the  activity  in  the  workstation  and  on  the  DSP  board,  and  the  interaction 
between  the  two,  but  to  the  system  designer  the  interface  will  appeal  seamless.  To  begin  this 
effort  we  have  developed  an  interface  between  a  VME-based  development  system  for  the 
Motorola  DSPS6001  and  Sun  3  system.  The  interface  has  been  developed  using  Gabriel  pending 
sufficient  capability  in  its  replacement  system.  The  limitations  of  the  VME  bus  and  Sun  3  works¬ 
tation  restrict  the  real-time  transfer  rate  of  our  interface  to  somewhat  less  than  20,0(X)  samples  per 
second.  To  proceed  to  the  next  logical  step,  therefore,  we  have  cooperated  with  a  small  local 
company  that  is  producing  a  DSP56(X)1  board  that  will  reside  on  the  S-bus  of  a  Sun  Sparc  works¬ 
tation.  We  are  developing  the  appropriate  device  driver  and  will  begin  to  design  closely  interact¬ 
ing  applications  that  use  both  computation  resources,  the  DSP  and  the  woikstation. 


2.30.  Ptolemy:  A  Non-Dogmatic  Third  Generation  Simulation  Environment  (E.A.  Lee  and 
D.G.  Messerschmitt) 


We  have  implemented  and  had  considerable  user  experience  with  two  generations  of  DSP 
simulation  environments,  Blosim  and  Gabriel.  Both  environments  used  a  block-diagram  data¬ 
flow  paradigm  for  the  description  of  the  algorithms.  For  the  future  we  see  the  need  for  other 
computational  models,  such  as  discrete-event  scheduling,  mixed  compile-time  and  run-time 
scheduling,  or  computational  models  based  on  shared-memory  data  structures.  These  are  not  sup¬ 
ported  very  gracefully  by  Blosim  or  Gabriel.  Most  importantly,  we  see  die  need  for  a  flexible 
simulation  environment  which  is  extensible  to  new  computational  models  without  re- 
implementadon  of  the  system. 

This  has  led  us  to  begin  development  of  a  new  environment  which  uses  object-oriented  pro¬ 
gramming  methodology.  Our  goal  is  to  make  it  non-dogmatic,  in  the  sense  diat  the  environmcm 
itself  does  not  impose  any  particular  computational  model,  and  it  is  extensible  to  new  models  by 
simply  adding  to  the  system  and  not  modifying  what  is  already  there.  Further  goals  are  to  incor¬ 
porate  features  that  have  been  successful  in  Blosim  or  Gabriel,  such  as  achieving  modularity  and 
reusability  of  user-programmed  software  modules,  hiendly  graphical  window  interfaces,  and 
code  generation  for  target  concurrent  architectures  rather  than  just  simulatiort 

A  first  version  of  the  system  currently  has  a  synchronous  data-flow  scheduler  and  an 
embryonic  discrete-event  scheduler.  This  system  will  thus  immediately  be  capable  of  simulating 
combinations  of  signal  processing  and  network  simulations  (such  as  in  packet  speech  and  packet 
video)  and  combinations  of  behavioral  and  hardware  simulation.  We  are  currently  working  on 
interfacing  the  system  to  the  same  graphical  interface  used  by  Gabriel,  which  uses  Vem  to  edit 
Oct  facets.  Consequently,  Oct  serves  as  the  design  database  for  the  system,  just  as  with  Gabriel. 
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